This system processes PDF files, generates embeddings using a local AI model, and stores them in PostgreSQL for semantic search capabilities. It features advanced OCR capabilities, semantic chunking, and hierarchical document understanding.
- Bulk PDF processing with progress tracking
- Automatic text and image extraction from PDFs
- Advanced OCR with image preprocessing:
- Automatic image format detection and conversion
- Image enhancement for better OCR quality
- Support for multiple image formats (JPEG, PNG, GIF, BMP, TIFF)
- Confidence scoring for OCR results
- Automatic image resizing and contrast enhancement
- AI-powered text embeddings generation for semantic understanding
- Hierarchical document processing:
- Document-level embeddings
- Section-level chunking
- Paragraph-level analysis
- Element-level processing (tables, images, forms)
- Semantic metadata extraction:
- Entity recognition
- Key phrase extraction
- Sentiment analysis
- Vector-based semantic search across all processed documents
- Detailed processing summaries and error handling
- Progress tracking with rich console output
- Efficient storage of embeddings in PostgreSQL using JSON format
- Python 3.8+
- PostgreSQL 16+ with pgvector extension
- Local AI embedding service running on port 1234
- Tesseract OCR engine
pip
andvenv
modules
- Clone the repository:
git clone https://github.com/kundu/doc-rag.git
cd doc-rag
- Create and activate virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
- Install Tesseract OCR:
# Ubuntu/Debian
sudo apt-get update
sudo apt-get install tesseract-ocr
sudo apt-get install tesseract-ocr-eng # English language pack
# macOS
brew install tesseract
# Windows
# Download installer from https://github.com/UB-Mannheim/tesseract/wiki
- Install PostgreSQL and pgvector extension:
# Ubuntu/Debian
sudo apt-get update
sudo apt-get install postgresql-16 postgresql-16-pgvector
# After installation, enable the extension
sudo -u postgres psql -d pdf_storage -c "CREATE EXTENSION IF NOT EXISTS vector;"
- Set up environment variables:
# Copy the example environment file
cp .env.example .env
# Edit the .env file with your configurations
nano .env # or use any text editor
Update the following variables in your .env
file:
# Database Configuration
DB_USER=postgres # Your PostgreSQL username
DB_PASSWORD=your_password # Your PostgreSQL password
DB_HOST=localhost # Database host
DB_PORT=5432 # Database port
DB_NAME=pdf_storage # Database name
# AI Embedding Service Configuration
AI_API_URL=http://127.0.0.1:1234/v1/embeddings # Your embedding service URL
AI_MODEL=text-embedding-nomic-embed-text-v1.5@f32 # Model name
# QA System Configuration
QA_API_URL=http://localhost:1234/v1/chat/completions
QA_MODEL=qwen2-0.5b-instruct
- Place your PDF files in the
pdf
directory:
mkdir -p pdf
cp your_pdfs/*.pdf pdf/
- Run the processor:
bash -c 'source venv/bin/activate && python pdf_processor.py'
- Run the QA system:
bash -c 'source venv/bin/activate && python qa_system.py'
The script will:
- Process all new PDF files in the
pdf
directory - Extract and OCR text from both document content and images
- Generate hierarchical embeddings
- Extract semantic metadata
- Store all information in the database
- Show detailed progress and processing summaries
- Search through processed PDFs:
from pdf_processor import search_similar_content
# Search for specific content
results = search_similar_content("your search query")
.
├── pdf/ # Directory for PDF files
├── models.py # Database models and enums
├── pdf_processor.py # Main processing script
├── qa_system.py # Question answering system
├── requirements.txt # Python dependencies
├── .env # Environment variables
└── README.md # This file
-
pdf_files
: Stores PDF file metadata and document-level embeddings- id (Primary Key)
- filename
- file_path
- upload_date
- file_size
- total_pages
- pdf_metadata (JSON)
- document_embedding (JSON)
-
pdf_embeddings
: Stores content embeddings and metadata- id (Primary Key)
- pdf_file_id (Foreign Key)
- page_number
- hierarchy_level (DOCUMENT, SECTION, PARAGRAPH, ELEMENT)
- content_type (TEXT, TABLE, IMAGE, FORM)
- page_content
- embedding (JSON)
- position (JSON)
- content_format (JSON)
- context
- semantic_metadata (JSON)
- confidence
The system includes comprehensive error handling:
- Graceful handling of corrupted PDFs
- Recovery from OCR failures
- Fallback chunking for large documents
- Image format conversion and validation
- Detailed error reporting and logging
- Efficient memory usage through streaming processing
- Image preprocessing for optimal OCR results
- Chunked processing of large documents
- Configurable chunk sizes and limits
- Background processing capabilities
- Fork the repository
- Create your feature branch
- Commit your changes
- Push to the branch
- Create a new Pull Request
-
Database Connection Issues
- Ensure PostgreSQL is running
- Verify database credentials in
.env
- Check if pgvector extension is installed
-
OCR Issues
- Verify Tesseract OCR is installed
- Check image quality and format
- Adjust preprocessing parameters if needed
-
Embedding Service Issues
- Verify the embedding service is running
- Check the API URL in
.env
- Ensure the model name is correct
-
Memory Issues
- Adjust chunk sizes in the configuration
- Process fewer files simultaneously
- Check available system resources
For support, please create an issue or contact Sauvik Kundu.