Skip to content

Natural Language Processing Project: Utilizing NLTK and Python to process and analyze the Reuters-21578 dataset, enhancing text retrieval through advanced tokenization, stemming, and stop word removal, along with implementing query processing and ranking mechanisms.

Notifications You must be signed in to change notification settings

sgalawar/nlp-data-processing-retrieval-system

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Natural Language Processing Project

Text processing & data retrieval system

The Reuters-21578 dataset is a collection of documents with news articles. The original corpus has 10,369 documents and a vocabulary of 29,930 words.

The goal of this project is to experiment with text processing with NLTK, and Python.

The project is divided into 3 parts:

Part 1 - Developed a pipeline to read data, extract it, tokenize it, lowercase it, apply the Porter Stemmer algorithm to it (to reduce the words to their root, eg. jumping -> jump), and remove stop words. In each step of the pipeline, the results are exported to a .txt file for clarity. Every step of the pipeline is also a separate function, given that modularity allows for better debugging.

Part 2 - Implemented a naive indexer (stores words and their locations), and a single-term query processing system (handles search for individual words).

Part 3 - Refined the indexing procedure. Implemented ranking of returns.

About

Natural Language Processing Project: Utilizing NLTK and Python to process and analyze the Reuters-21578 dataset, enhancing text retrieval through advanced tokenization, stemming, and stop word removal, along with implementing query processing and ranking mechanisms.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages