GitHub - LaoLiulaoliu/udacity-search-engine

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.gitignore		.gitignore
README		README
cache.py		cache.py
confparse.py		confparse.py
crawler.py		crawler.py
engine.ini		engine.ini
engine_search.py		engine_search.py
engine_spider.py		engine_spider.py
get_url.py		get_url.py
html_text.py		html_text.py
indexing.py		indexing.py
log.py		log.py
md5.py		md5.py
pickle_file.py		pickle_file.py
porter_stemming.py		porter_stemming.py
rank.py		rank.py
search.py		search.py
thread_pool.py		thread_pool.py
tools.py		tools.py

Repository files navigation

udacity-search-engine

##################################


This search engine is separated into two parts.
1. engine_spider.py
2. engine_search.py
We can crawl new things to enlarge the data, and search word without a crawling process.


engine_spider:
    Crawling web pages using thread pool model. At the same time, indexing all the words with crawled page(word stemming in it). After crawling and indexing, based on all pages' outgoing link, compute every page's ranking.
    Dump the ranking and indexing variables into the hard disk.
    Load the ranking and indexing variables from old data on hard disk.
    Whenever need to stop the spider,use Ctrl+C signal, program will dump the variables from memory to disk automatically.
    Auto detecte the webpages' encoding, and decode them.

engine_search:
    Load ranking and indexing variables from hard disk.
    Search the words or a sentence.
    Inputing words from the console, then stemming and searching.
    Showing the result in smart way.(quick sort and count the NO. of different word in one url)
    Caching the query result.