🚀 NLP Project: RAG-based Question Answering System 🤖

Comprehensive implementation of advanced NLP techniques for question answering using Retrieval-Augmented Generation (RAG), toxicity detection, topic modeling, and large language model evaluation.

👨‍👩‍👧‍👦 Team Members

This project was developed by a team of students from Politecnico di Milano:

📋 Project Overview

This repository contains a comprehensive NLP project built on the neural-bridge/rag-dataset-12000 dataset. Our work encompasses the entire data science pipeline from exploratory data analysis to model deployment, with a special focus on question answering capabilities using state-of-the-art language models and retrieval techniques.

✨ Key Features & Components

📊 Data Analysis & Preprocessing

Exploratory data analysis of question-answer-context triples
Text cleaning with stopword removal, lemmatization, and normalization
Statistical analysis of question types and document lengths

🔤 Embedding & Semantic Analysis

Word2Vec embeddings with skip-gram and negative sampling
K-means clustering with optimal cluster selection using silhouette analysis
TF-IDF vectorization with visual word highlighting
Sentence embeddings using all-MiniLM-L6-v2

🧩 Topic Modeling & Classification

BERTopic implementation for context clustering
Topic visualization with interactive maps and hierarchical clustering
Toxicity detection using weakly supervised labeling
BERT-based binary classifier for content moderation

🤖 Language Model Evaluation

Zero-shot question answering with Google's Gemma-2b and T5 models
Context-enhanced generation evaluation
Performance assessment using BERTScore, ROUGE, and F1 metrics
Comparative analysis between models with/without context

🔍 Retrieval-Augmented Generation (RAG)

Dense retrieval using sentence transformers
Sparse retrieval with TF-IDF
Context-aware answer generation with Gemma
Performance evaluation against reference answers

🎮 Interactive Demonstrations

Gradio-based UI for search and question answering
Text-to-speech synthesis for answer verbalization
Automatic evaluation with BERTScore integration
Comparative answer visualization

📚 Dataset

The project utilizes the neural-bridge/rag-dataset-12000 dataset from Hugging Face, which contains:

Context passages containing information
Natural language questions related to each context
Reference answers generated by GPT-4
Train/test split with 9,600/2,400 samples

A copy of the dataset is also available on Kaggle.

📝 Results & Findings

Our analysis revealed several key insights:

Question distribution: The dataset is heavily skewed toward "what" questions (78%), with much fewer "who," "how," and "why" questions
Context length: Most contexts contain 200-450 tokens, making them substantial but manageable for retrieval systems
Model performance: Gemma consistently outperforms T5 across all metrics, with context-enhanced answers showing substantial improvements over zero-shot generation
BERTScore results: High semantic similarity (0.91 F1) between context-enhanced Gemma answers and reference answers, despite lower exact match rates
Topic diversity: BERTopic successfully identified distinct thematic clusters within the dataset, with some containing potentially toxic content

💻 Usage

The project is implemented as a Jupyter notebook with comprehensive documentation and interactive components. To reproduce our results:

Clone this repository
Install required dependencies: pip install -r requirements.txt
Run the notebook NLP-project.ipynb
Explore the interactive demonstrations and analysis

🔗 Additional Resources

Pre-trained toxicity classifier

This project was completed as part of the Natural Language Processing course at Politecnico di Milano (a.y. 2024/25).

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.gitignore		.gitignore
NLP-project.ipynb		NLP-project.ipynb
NLP-project_runned.ipynb		NLP-project_runned.ipynb
README.md		README.md
kernel-metadata.json.example		kernel-metadata.json.example
requirements.txt		requirements.txt
subset_with_generated_answers_with_context_gemma.csv		subset_with_generated_answers_with_context_gemma.csv
subset_with_generated_answers_with_context_t5.csv		subset_with_generated_answers_with_context_t5.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 NLP Project: RAG-based Question Answering System 🤖

👨‍👩‍👧‍👦 Team Members

📋 Project Overview

✨ Key Features & Components

📊 Data Analysis & Preprocessing

🔤 Embedding & Semantic Analysis

🧩 Topic Modeling & Classification

🤖 Language Model Evaluation

🔍 Retrieval-Augmented Generation (RAG)

🎮 Interactive Demonstrations

📚 Dataset

📝 Results & Findings

💻 Usage

🔗 Additional Resources

About

Uh oh!

Contributors 4

Uh oh!

Languages

martinimarcello00/NLP-project

Folders and files

Latest commit

History

Repository files navigation

🚀 NLP Project: RAG-based Question Answering System 🤖

👨‍👩‍👧‍👦 Team Members

📋 Project Overview

✨ Key Features & Components

📊 Data Analysis & Preprocessing

🔤 Embedding & Semantic Analysis

🧩 Topic Modeling & Classification

🤖 Language Model Evaluation

🔍 Retrieval-Augmented Generation (RAG)

🎮 Interactive Demonstrations

📚 Dataset

📝 Results & Findings

💻 Usage

🔗 Additional Resources

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors 4

Uh oh!

Languages