PDF Embedding and Search System

This system processes PDF files, generates embeddings using a local AI model, and stores them in PostgreSQL for semantic search capabilities. It features advanced OCR capabilities, semantic chunking, and hierarchical document understanding.

Features

Bulk PDF processing with progress tracking
Automatic text and image extraction from PDFs
Advanced OCR with image preprocessing:
- Automatic image format detection and conversion
- Image enhancement for better OCR quality
- Support for multiple image formats (JPEG, PNG, GIF, BMP, TIFF)
- Confidence scoring for OCR results
- Automatic image resizing and contrast enhancement
AI-powered text embeddings generation for semantic understanding
Hierarchical document processing:
- Document-level embeddings
- Section-level chunking
- Paragraph-level analysis
- Element-level processing (tables, images, forms)
Semantic metadata extraction:
- Entity recognition
- Key phrase extraction
- Sentiment analysis
Vector-based semantic search across all processed documents
Detailed processing summaries and error handling
Progress tracking with rich console output
Efficient storage of embeddings in PostgreSQL using JSON format

Prerequisites

Python 3.8+
PostgreSQL 16+ with pgvector extension
Local AI embedding service running on port 1234
Tesseract OCR engine
pip and venv modules

Installation

Clone the repository:

git clone https://github.com/kundu/doc-rag.git
cd doc-rag

Create and activate virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Install Tesseract OCR:

# Ubuntu/Debian
sudo apt-get update
sudo apt-get install tesseract-ocr
sudo apt-get install tesseract-ocr-eng  # English language pack

# macOS
brew install tesseract

# Windows
# Download installer from https://github.com/UB-Mannheim/tesseract/wiki

Install PostgreSQL and pgvector extension:

# Ubuntu/Debian
sudo apt-get update
sudo apt-get install postgresql-16 postgresql-16-pgvector

# After installation, enable the extension
sudo -u postgres psql -d pdf_storage -c "CREATE EXTENSION IF NOT EXISTS vector;"

Set up environment variables:

# Copy the example environment file
cp .env.example .env

# Edit the .env file with your configurations
nano .env  # or use any text editor

Configuration

Update the following variables in your .env file:

# Database Configuration
DB_USER=postgres          # Your PostgreSQL username
DB_PASSWORD=your_password # Your PostgreSQL password
DB_HOST=localhost        # Database host
DB_PORT=5432            # Database port
DB_NAME=pdf_storage     # Database name

# AI Embedding Service Configuration
AI_API_URL=http://127.0.0.1:1234/v1/embeddings  # Your embedding service URL
AI_MODEL=text-embedding-nomic-embed-text-v1.5@f32  # Model name

# QA System Configuration
QA_API_URL=http://localhost:1234/v1/chat/completions
QA_MODEL=qwen2-0.5b-instruct

Usage

Place your PDF files in the pdf directory:

mkdir -p pdf
cp your_pdfs/*.pdf pdf/

Run the processor:

bash -c 'source venv/bin/activate && python pdf_processor.py'

Run the QA system:

bash -c 'source venv/bin/activate && python qa_system.py'

The script will:

Process all new PDF files in the pdf directory
Extract and OCR text from both document content and images
Generate hierarchical embeddings
Extract semantic metadata
Store all information in the database
Show detailed progress and processing summaries

Search through processed PDFs:

from pdf_processor import search_similar_content

# Search for specific content
results = search_similar_content("your search query")

Project Structure

.
├── pdf/                  # Directory for PDF files
├── models.py            # Database models and enums
├── pdf_processor.py     # Main processing script
├── qa_system.py         # Question answering system
├── requirements.txt     # Python dependencies
├── .env                # Environment variables
└── README.md           # This file

Database Schema

pdf_files: Stores PDF file metadata and document-level embeddings
- id (Primary Key)
- filename
- file_path
- upload_date
- file_size
- total_pages
- pdf_metadata (JSON)
- document_embedding (JSON)
pdf_embeddings: Stores content embeddings and metadata
- id (Primary Key)
- pdf_file_id (Foreign Key)
- page_number
- hierarchy_level (DOCUMENT, SECTION, PARAGRAPH, ELEMENT)
- content_type (TEXT, TABLE, IMAGE, FORM)
- page_content
- embedding (JSON)
- position (JSON)
- content_format (JSON)
- context
- semantic_metadata (JSON)
- confidence

Error Handling

The system includes comprehensive error handling:

Graceful handling of corrupted PDFs
Recovery from OCR failures
Fallback chunking for large documents
Image format conversion and validation
Detailed error reporting and logging

Performance Optimization

Efficient memory usage through streaming processing
Image preprocessing for optimal OCR results
Chunked processing of large documents
Configurable chunk sizes and limits
Background processing capabilities

Contributing

Fork the repository
Create your feature branch
Commit your changes
Push to the branch
Create a new Pull Request

Troubleshooting

Database Connection Issues
- Ensure PostgreSQL is running
- Verify database credentials in .env
- Check if pgvector extension is installed
OCR Issues
- Verify Tesseract OCR is installed
- Check image quality and format
- Adjust preprocessing parameters if needed
Embedding Service Issues
- Verify the embedding service is running
- Check the API URL in .env
- Ensure the model name is correct
Memory Issues
- Adjust chunk sizes in the configuration
- Process fewer files simultaneously
- Check available system resources

Support

For support, please create an issue or contact Sauvik Kundu.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF Embedding and Search System

Features

Prerequisites

Installation

Configuration

Usage

Project Structure

Database Schema

Error Handling

Performance Optimization

Contributing

Troubleshooting

Support

Visuals

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
output		output
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
models.py		models.py
pdf_processor.py		pdf_processor.py
qa_system.py		qa_system.py
requirements.txt		requirements.txt

kundu/doc-rag

Folders and files

Latest commit

History

Repository files navigation

PDF Embedding and Search System

Features

Prerequisites

Installation

Configuration

Usage

Project Structure

Database Schema

Error Handling

Performance Optimization

Contributing

Troubleshooting

Support

Visuals

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages