Comprehensive implementation of advanced NLP techniques for question answering using Retrieval-Augmented Generation (RAG), toxicity detection, topic modeling, and large language model evaluation.
This project was developed by a team of students from Politecnico di Milano:
This repository contains a comprehensive NLP project built on the neural-bridge/rag-dataset-12000 dataset. Our work encompasses the entire data science pipeline from exploratory data analysis to model deployment, with a special focus on question answering capabilities using state-of-the-art language models and retrieval techniques.
- Exploratory data analysis of question-answer-context triples
- Text cleaning with stopword removal, lemmatization, and normalization
- Statistical analysis of question types and document lengths
- Word2Vec embeddings with skip-gram and negative sampling
- K-means clustering with optimal cluster selection using silhouette analysis
- TF-IDF vectorization with visual word highlighting
- Sentence embeddings using all-MiniLM-L6-v2
- BERTopic implementation for context clustering
- Topic visualization with interactive maps and hierarchical clustering
- Toxicity detection using weakly supervised labeling
- BERT-based binary classifier for content moderation
- Zero-shot question answering with Google's Gemma-2b and T5 models
- Context-enhanced generation evaluation
- Performance assessment using BERTScore, ROUGE, and F1 metrics
- Comparative analysis between models with/without context
- Dense retrieval using sentence transformers
- Sparse retrieval with TF-IDF
- Context-aware answer generation with Gemma
- Performance evaluation against reference answers
- Gradio-based UI for search and question answering
- Text-to-speech synthesis for answer verbalization
- Automatic evaluation with BERTScore integration
- Comparative answer visualization
The project utilizes the neural-bridge/rag-dataset-12000 dataset from Hugging Face, which contains:
- Context passages containing information
- Natural language questions related to each context
- Reference answers generated by GPT-4
- Train/test split with 9,600/2,400 samples
A copy of the dataset is also available on Kaggle.
Our analysis revealed several key insights:
- Question distribution: The dataset is heavily skewed toward "what" questions (78%), with much fewer "who," "how," and "why" questions
- Context length: Most contexts contain 200-450 tokens, making them substantial but manageable for retrieval systems
- Model performance: Gemma consistently outperforms T5 across all metrics, with context-enhanced answers showing substantial improvements over zero-shot generation
- BERTScore results: High semantic similarity (0.91 F1) between context-enhanced Gemma answers and reference answers, despite lower exact match rates
- Topic diversity: BERTopic successfully identified distinct thematic clusters within the dataset, with some containing potentially toxic content
The project is implemented as a Jupyter notebook with comprehensive documentation and interactive components. To reproduce our results:
- Clone this repository
- Install required dependencies:
pip install -r requirements.txt - Run the notebook NLP-project.ipynb
- Explore the interactive demonstrations and analysis
This project was completed as part of the Natural Language Processing course at Politecnico di Milano (a.y. 2024/25).