Skip to content

Built a search engine from scratch for a Wikipedia corpus of over 21 million articles (85 Gb) to give search results within 4 seconds. Parsed Wikipedia pages into tokens by applying appropriate techniques and built an inverted index structure for the corpus. Used NLP techniques to implement page ranking to get top search results according to rel…

Notifications You must be signed in to change notification settings

sushant-09/Wikipedia-Search-Engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

Wikipedia Search Engine

Problem Statement

To design and implement a search engine from scratch on a Wikipedia corpus (English) of over 21 million articles (~85 Gb in size) to give search results within 3-4 seconds.

Overview
  • The XML dump is parsed to get Wikipedia documents.
  • Stop words removal and stemming using NLTK.
  • An inverted index structure was created which contains posting lists of each word.
  • While searching, these posting lists are retrieved to get document IDs, then TF-IDF is applied to get most relevant search results.
System requirements
  • python3
  • NLTK
  • 85GB of free space for the corpus and 21GB for inverted index
To run

Create inverted index

python3 indexer.py <path to wikipedia dump>

Search

python3 search <search query>

About

Built a search engine from scratch for a Wikipedia corpus of over 21 million articles (85 Gb) to give search results within 4 seconds. Parsed Wikipedia pages into tokens by applying appropriate techniques and built an inverted index structure for the corpus. Used NLP techniques to implement page ranking to get top search results according to rel…

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages