Skip to content

ChrisSun99/poogle-search-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Poogle Search Engine

The repository of Poogle search engine. Final report is available here.

Getting Started

We use Apache Maven to manage compilation, testing, and execution.

Prerequisites

We are using AWS DynamoDB. To run the code, you'll need to create the following DynamoDB tables.

  • URL: uses url as the partition key and contains a global secondary index md5-weight-index which uses md5 as the partition key and weight as the sort key. The table also has attributes date and outboundLinks.
  • DOCUMENT: uses md5 as the partition key and contains attributes date and document.
  • INVIDX: uses word as the partition key and md5 as the sort key.

Installation

If trying to run each part of the code, you'll need to add and apply the following command in Run Configuration

clean install exec:java@crawler   # Run the crawler and update the database

clean install exec:java@pagerank  # Run the pagerank MapReduce job and update the database

clean install exec:java@indexer   # Run the indexer MapReduce job and update the database

clean install exec:java@server    # Start the search engine server

For the frontend on the client side, you'll need to go to ./client directory and execute npm start. Open http://localhost:3000 to view the web page in a browser.

Features

Crawler

We have two major version for crawler, one implemented with ThreadPool, and another implemented with Apache Storm and Kafka.

Main entrypoint is edu.upenn.cis.cis455.crawler.Crawler, currently all crawler file is on crawler-k8s-threadpool branch. The crawler uses bloom filter to remove duplicate and seen urls, stores a LRU cache for robots.txt, and group several urls for batch updates. Non-HTML content is parsed with Apache Tika. All web metadata and documents are stored in Amazon DynamoDB.

We intended to deploy the thread-pool version of distributed crawler with kubernetes, each crawler functions as a seperate program except they will share a common url queue, hosted by Amazon SQS. We will also explore the more powerful distributed crawler implemented with Apache Storm and Kafka. Our plan is to host our Storm and Kafka (nimbus, zookeepers, supervisors, etc.) on a kubernetes cluster.

PageRank

We have implemented an EMR-based PageRank.

Main function is located at edu.upenn.cis.cis455.pagerank.PageRankInterface, it takes in three arguments:

  1. Input file location containing urls and their outbound links.
  2. Desired output directory.
  3. A boolean that is true if we want to distribute less weight to intra-domain links, false if we want to treat intra-domain and inter-domain links the same.

Indexer

We have implemented an EMR-based indexer that creates inverted index for the crawled document corpus.

Search Engine

We used React.js to develop the frontend of the search engine. We developed the frontend with reference to this Medium article. Codes were adopted from https://github.com/5ebs/Google-Clone with large modification. Users are able to see the url link and a snippet of preview of the web page. Our search engine caches the search results to BerkeleyDB. When a user searches the same query, the search engine will respond quickly with the result from cache.

Extra Credits

  1. Crawler can handle non-html data.
  2. Crawler can store partial metadata about the web documents.
  3. Indexer uses metadata of web pages to improve the rankings

Source Files

  • Crawler: edu.upenn.cis.cis455.crawler
  • PageRank: edu.upenn.cis.cis455.pagerank
  • Indexer: edu.upenn.cis.cis455.indexer
  • Search Engine: edu.upenn.cis.cis455.searchengine

About

A Google-style distributed search engine.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published