Welcome! This repository contains a Natural Language Processing application that processes a text file to cluster sentences into topics, tracks the evolution of entities across clusters, and generates multiple-choice questions based on the clustered content. The project uses advanced NLP techniques, including sentence embeddings, entity recognition, and integration with the Groq API for question generation, to analyze and summarize historical or textual data (e.g., from history.txt).
https://docs.google.com/document/d/1EjMQ2aIJb6qrNYKEzcZV0XeclVjE8ovlYg8W-gP1IIU/edit?usp=sharing
- Topic Clustering: Groups sentences from a text file into topics using K-Means clustering and sentence embeddings (via
SentenceTransformer). - Entity Evolution Tracking: Analyzes entities in each cluster, capturing their properties (e.g., adjectives) and relationships (e.g., subject-verb-object triples) using SpaCy.
- MCQ Generation: Generates multiple-choice questions for selected clusters using the Groq API, with correct answers, distractors, and explanations based on entity data.
- Output Files: Saves clustering results (
labels.txt,topics.txt), entity evolution (entity_evolution.txt), and generated MCQs (generated_mcqs.txt). - Modular code structure for easy extension and customization.
- Python 3.8 or higher
- Git (to clone the repository)
- A virtual environment tool (e.g.,
venvorvirtualenv) - A Groq API key (sign up at https://groq.com and set it as an environment variable)
- Internet access for downloading NLTK data and accessing the Groq API
To set up the project, follow these steps. I strongly recommend using a virtual environment to manage dependencies and avoid conflicts with other projects.
-
Clone the repository:
git clone https://github.com/NathanP9000/NLP_Project.git cd NLP_Project -
Create and activate a virtual environment:
- On Windows (powershell):
python -m venv venv .\venv\Scripts\activate
- On macOS/Linux:
python3 -m venv venv source venv/bin/activate
- On Windows (powershell):
-
Install dependencies: The project includes a
requirements.txtfile listing all required packages. Install them using:pip install -r requirements.txt
This installs key libraries such as
numpy,spacy,sentence-transformers,scikit-learn,nltk, andrequests. -
Set up SpaCy model: Download the SpaCy English model:
python -m spacy download en_core_web_sm
-
Set up LLM API key: Obtain a Groq API key from https://groq.com and set it as an environment variable:
- On Windows:
set GROQ_API_KEY=your-api-key - On macOS/Linux:
- Option 1: Use
exportcommand (temporary, lasts for the current terminal session):export GROQ_API_KEY=your-api-key - Option 2: Use an
env.shfile (persistent across sessions): Create a file namedenv.shin the project root:Source the file to apply the environment variable:echo "export GROQ_API_KEY=your-api-key" > env.sh
To make it persistent, sourcesource env.shenv.shin your shell configuration file (e.g.,~/.bashrcor~/.zshrc) by addingsource /path/to/NLP_Project/env.sh.
- Option 1: Use
- On Windows:
-
Prepare input data:
- Place your input text file (e.g., historical or narrative text) in the project root as
history.txt. Ensure it is encoded in UTF-8. - The text will be tokenized into sentences and processed for clustering and MCQ generation.
- Place your input text file (e.g., historical or narrative text) in the project root as
-
Run the project:
- Execute the main script to process
history.txt, cluster topics, track entity evolution, and generate MCQs:python main.py
- The script performs the following:
- Clusters sentences into topics using K-Means (default: 30 clusters).
- Extracts entity properties and relationships for each cluster.
- Generates MCQs for a random subset of clusters (default: 5 clusters, 3 questions each) using the Groq API.
- Saves results to output files.
- Execute the main script to process
-
View results:
- Clustering outputs:
labels.txt: Cluster labels for each sentence.topics.txt: Sentences grouped by cluster.
- Entity evolution:
entity_evolution.txt: Properties and relationships of entities in each cluster.
- MCQs:
generated_mcqs.txt: Generated multiple-choice questions with answers and explanations.
- All output files are saved in the project root.
- Clustering outputs:
NLP_Project/
├── history.txt # Input text file (e.g., historical or narrative data)
├── labels.txt # Output: Cluster labels for sentences
├── topics.txt # Output: Sentences grouped by cluster
├── entity_evolution.txt # Output: Entity properties and relationships per cluster
├── generated_mcqs.txt # Output: Generated multiple-choice questions
├── main.py # Main script to run the project
├── requirements.txt # List of dependencies
└── README.md # Project documentation