RAG Engine with Pinecone and OpenAI

This project implements a Retrieval-Augmented Generation (RAG) system that indexes manuals using Pinecone and OpenAI, processes PDF documents (both searchable and scanned), and provides answers based on the indexed content. It utilizes a combination of Pinecone for vector storage and search, OpenAI's embeddings for querying, and Optical Character Recognition (OCR) via Tesseract to process non-searchable PDFs.

Resources

Miro Board
Medium Article

Prerequisites

Python 3.12
Poetry (for managing dependencies)
Pinecone API Key
OpenAI API Key

Setup

1. Clone the repository

git clone <repository_url>
cd <repository_directory>

2. Install dependencies

This project uses Poetry for dependency management. Install Poetry and then install the project dependencies:

pip install poetry
poetry install

3. Create `.env` file

Create a .env file in the project root directory with the following content:

OPENAI_API_KEY=<your_openai_api_key>
PINECONE_API_KEY=<your_pinecone_api_key>

Ensure that the .env file is listed in .gitignore to prevent it from being uploaded to version control.

4. Install Tesseract and other dependencies

For OCR functionality, you will need Tesseract installed on your system.

On Ubuntu:

sudo apt-get install tesseract-ocr

On macOS (using Homebrew):

brew install tesseract

Additionally, you will need to install pdf2image:

On Ubuntu:

sudo apt-get install poppler-utils  # Required for pdf2image

On macOS (using Homebrew):

brew install poppler

Running the Project

1. Indexing PDFs in Pinecone

To index PDFs into Pinecone, run the indexing_pinecone.py script. This script extracts text from the provided PDFs (handling both searchable and scanned PDFs), generates embeddings using OpenAI, and upserts the vectors into Pinecone.

python indexing_pinecone.py

Ensure you have the necessary PDFs (e.g., Caterpillar-3500-generator-sets-operation-and-maintenance-manual.pdf and Waukesha_VGF_f18g.pdf) in the manuals/ directory.

2. Running the RAG Pipeline

Once your PDFs are indexed, you can run the RAG.py script to query the indexed manuals. The script allows you to ask a question, and it retrieves the most relevant pages from the manuals using Pinecone and displays the text.

python RAG.py

You will be prompted to choose between the retrieved manuals (e.g., Caterpillar or Waukesha) to get the relevant information.

Functions

`indexing_pinecone.py`

initialize_pinecone(): Initializes Pinecone, creating an index if necessary.
clear_pinecone_index(index): Clears the Pinecone index before inserting new data.
extract_text_from_pdf(pdf_path): Extracts text from a PDF, using OCR if necessary for scanned documents.
main: Upserts each page with the relevant metadata to Pinecone.

`RAG.py`

get_text_from_pdf_page(source_doc, page_number): Extracts text from a specific page of a PDF, using OCR if necessary.
query_pinecone(question): Queries Pinecone for the most relevant pages based on a question.
choose_manual(manuals_retrieved): Allows the user to select the relevant manual from the results.
main: Retrieves chunks (pages) relevant to the selected manual.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
__pycache__		__pycache__
manuals		manuals
.DS_Store		.DS_Store
.gitignore		.gitignore
RAG.py		RAG.py
README.md		README.md
architecture_diagram.png		architecture_diagram.png
indexing_pinecone.py		indexing_pinecone.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RAG Engine with Pinecone and OpenAI

Resources

Prerequisites

Setup

1. Clone the repository

2. Install dependencies

3. Create `.env` file

4. Install Tesseract and other dependencies

Running the Project

1. Indexing PDFs in Pinecone

2. Running the RAG Pipeline

Functions

`indexing_pinecone.py`

`RAG.py`

About

Uh oh!

Releases

Packages

Uh oh!

Languages

emailic/RAG_manuals

Folders and files

Latest commit

History

Repository files navigation

RAG Engine with Pinecone and OpenAI

Resources

Prerequisites

Setup

1. Clone the repository

2. Install dependencies

3. Create .env file

4. Install Tesseract and other dependencies

Running the Project

1. Indexing PDFs in Pinecone

2. Running the RAG Pipeline

Functions

indexing_pinecone.py

RAG.py

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

3. Create `.env` file

`indexing_pinecone.py`

`RAG.py`

Packages