This project demonstrates the creation of a Retrieval-Augmented Generation (RAG) system. It combines generative AI and vector similarity search to build a robust retrieval mechanism for car data.
The system uses the following components:
- Generated Data: Simulates a dataset of car details using OpenAI's GPT-3.5 Turbo model.
- Vectorization: Converts car descriptions into vector representations using OpenAI's Embedding model.
- Vector Storage: Stores and retrieves vectors efficiently with Chroma, an in-memory vector database.
The generated dataset includes the following attributes:
- Name: Car name
- Price: Car price
- Engine: Engine type
- Description: Detailed car description
- Install Python 3.8 or later.
- Ensure you have
pip
installed.
Run the following command to install all necessary dependencies:
pip install -r requirements.txt
Create a .env
file in the project directory with the following content:
OPENAI_API_KEY=<your-openai-api-key>
Replace <your-openai-api-key>
with your actual OpenAI API key.
-
Clone the Repository:
git clone https://github.com/thiago-grabe/rag-example.git cd rag-example
-
Set Up the Environment: Install the required packages by running:
pip install -r requirements.txt
-
Run the Jupyter Notebook: Launch the notebook with:
jupyter notebook
-
Open the notebook file and follow the instructions within.
- Data Augmentation: Generates a car dataset programmatically using OpenAI's GPT model.
- Vectorization: Embeds car descriptions into vector space for similarity search.
- Retrieval: Uses Chroma as the backend to fetch relevant vectors efficiently.
The following dependencies are used in the project (also listed in requirements.txt
):
- langchain: For handling LLM-based workflows.
- chromadb: Vector database for efficient retrieval.
- transformers: Hugging Face library for models and tokenizers.
- sentence-transformers: For embedding generation.
- openai: For GPT-based generation and embedding creation.
- numpy: Numerical computations.
- ipywidgets and ipykernel: For interactive notebook features.
An example entry in the dataset:
Name: Toyota Camry
Price: $25,000
Engine: Hybrid
Description: A reliable and fuel-efficient sedan with advanced features.
- Enhance the dataset generation pipeline.
- Implement advanced ranking algorithms for retrieved results.
- Integrate with external APIs for real-world datasets.