Skip to content

Advanced NLP implementation showcasing Retrieval-Augmented Generation (RAG), topic modeling, and transformer-based question answering using Gemma and T5 models on the neural-bridge dataset. Includes interactive demos, toxicity detection, and comprehensive performance evaluation.

Notifications You must be signed in to change notification settings

martinimarcello00/NLP-project

Repository files navigation

🚀 NLP Project: RAG-based Question Answering System 🤖

Python Hugging Face PyTorch Natural Language Processing

Comprehensive implementation of advanced NLP techniques for question answering using Retrieval-Augmented Generation (RAG), toxicity detection, topic modeling, and large language model evaluation.

👨‍👩‍👧‍👦 Team Members

This project was developed by a team of students from Politecnico di Milano:

📋 Project Overview

This repository contains a comprehensive NLP project built on the neural-bridge/rag-dataset-12000 dataset. Our work encompasses the entire data science pipeline from exploratory data analysis to model deployment, with a special focus on question answering capabilities using state-of-the-art language models and retrieval techniques.

✨ Key Features & Components

📊 Data Analysis & Preprocessing

  • Exploratory data analysis of question-answer-context triples
  • Text cleaning with stopword removal, lemmatization, and normalization
  • Statistical analysis of question types and document lengths

🔤 Embedding & Semantic Analysis

  • Word2Vec embeddings with skip-gram and negative sampling
  • K-means clustering with optimal cluster selection using silhouette analysis
  • TF-IDF vectorization with visual word highlighting
  • Sentence embeddings using all-MiniLM-L6-v2

🧩 Topic Modeling & Classification

  • BERTopic implementation for context clustering
  • Topic visualization with interactive maps and hierarchical clustering
  • Toxicity detection using weakly supervised labeling
  • BERT-based binary classifier for content moderation

🤖 Language Model Evaluation

  • Zero-shot question answering with Google's Gemma-2b and T5 models
  • Context-enhanced generation evaluation
  • Performance assessment using BERTScore, ROUGE, and F1 metrics
  • Comparative analysis between models with/without context

🔍 Retrieval-Augmented Generation (RAG)

  • Dense retrieval using sentence transformers
  • Sparse retrieval with TF-IDF
  • Context-aware answer generation with Gemma
  • Performance evaluation against reference answers

🎮 Interactive Demonstrations

  • Gradio-based UI for search and question answering
  • Text-to-speech synthesis for answer verbalization
  • Automatic evaluation with BERTScore integration
  • Comparative answer visualization

📚 Dataset

The project utilizes the neural-bridge/rag-dataset-12000 dataset from Hugging Face, which contains:

  • Context passages containing information
  • Natural language questions related to each context
  • Reference answers generated by GPT-4
  • Train/test split with 9,600/2,400 samples

A copy of the dataset is also available on Kaggle.

📝 Results & Findings

Our analysis revealed several key insights:

  • Question distribution: The dataset is heavily skewed toward "what" questions (78%), with much fewer "who," "how," and "why" questions
  • Context length: Most contexts contain 200-450 tokens, making them substantial but manageable for retrieval systems
  • Model performance: Gemma consistently outperforms T5 across all metrics, with context-enhanced answers showing substantial improvements over zero-shot generation
  • BERTScore results: High semantic similarity (0.91 F1) between context-enhanced Gemma answers and reference answers, despite lower exact match rates
  • Topic diversity: BERTopic successfully identified distinct thematic clusters within the dataset, with some containing potentially toxic content

💻 Usage

The project is implemented as a Jupyter notebook with comprehensive documentation and interactive components. To reproduce our results:

  1. Clone this repository
  2. Install required dependencies: pip install -r requirements.txt
  3. Run the notebook NLP-project.ipynb
  4. Explore the interactive demonstrations and analysis

🔗 Additional Resources


This project was completed as part of the Natural Language Processing course at Politecnico di Milano (a.y. 2024/25).

About

Advanced NLP implementation showcasing Retrieval-Augmented Generation (RAG), topic modeling, and transformer-based question answering using Gemma and T5 models on the neural-bridge dataset. Includes interactive demos, toxicity detection, and comprehensive performance evaluation.

Topics

Resources

Stars

Watchers

Forks

Contributors 4

  •  
  •  
  •  
  •