A powerful AI-powered PDF search and question-answering system that leverages cutting-edge technologies to provide intelligent document analysis and natural language interactions.
Author: Muhammad Husnain Ali
- Python - Primary programming language
- LangChain - Framework for building LLM applications
- OpenAI GPT-3.5 - Large Language Model for text processing
- Streamlit - Web application framework
- Pinecone - Vector database for similarity search
- Supabase - PostgreSQL database for conversation history
- PyPDF2 - PDF processing library
- OpenAI Embeddings - text-embedding-3-small model (512 dimensions)
- Vector Search - Semantic similarity matching
- Conversation Memory - Context-aware chat history
- Python Virtual Environment - Dependency isolation
- Environment Variables - Secure configuration management
- SQL - Database schema management
-
Advanced PDF Processing
- Automatic text extraction and semantic chunking
- Support for multiple PDF uploads
- Intelligent document metadata preservation
- OCR support for scanned documents
-
AI-Powered Question Answering
- Natural language understanding
- Context-aware responses
- Multi-document correlation
- Source attribution with page numbers
-
Enterprise-Grade Vector Search
- High-performance similarity matching
- Scalable document indexing
- Real-time search capabilities
- Configurable search parameters
-
Smart Conversation Management
- Persistent chat history
- Context retention across sessions
- Conversation summarization
- Multi-user support
-
Modern Web Interface
- Responsive design
- Dark/Light mode support
- Real-time updates
- Mobile-friendly layout
- Frontend: Streamlit web interface
- LLM: OpenAI GPT-3.5-turbo for intelligent responses
- Embeddings: OpenAI text-embedding-3-small (512 dimensions)
- Vector Store: Pinecone for document similarity search
- Memory: Supabase PostgreSQL for conversation persistence
- PDF Processing: PyPDF2 with intelligent text chunking
- Python 3.8+
- OpenAI API key
- Pinecone account and API key
- Supabase project (for conversation memory)
# Clone the repository
git clone <repository-url>
cd ai-pdf-search-engine
# Create and activate virtual environment
# For Windows
python -m venv venv
.\venv\Scripts\activate
# For macOS/Linux
python -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
Create .env
file in the project root:
OPENAI_API_KEY=your_openai_api_key_here
PINECONE_API_KEY=your_pinecone_api_key_here
PINECONE_INDEX_NAME=your_index_name
EMBEDDING_MODEL=text-embedding-3-small
EMBEDDING_DIMENSION=512
SUPABASE_URL=your_supabase_project_url
SUPABASE_KEY=your_supabase_anon_key
- Navigate to your Supabase project dashboard
- Go to the SQL Editor
- Open the provided
setup_supabase.sql
file in the project root - Execute the SQL commands to:
- Create chat sessions and messages tables
- Set up appropriate indexes
- Enable Row Level Security (RLS)
- Configure access policies
The SQL file includes all necessary table definitions, indexes, and security policies for the chat system.
# Make sure your virtual environment is activated
# For Windows
.\venv\Scripts\activate
# For macOS/Linux
source venv/bin/activate
# Run the application
streamlit run app.py
When you're done working on the project, you can deactivate the virtual environment:
deactivate
ai-pdf-search-engine/
├── app.py # Streamlit web interface
├── config.py # Configuration and environment variables
├── pdf_processor.py # PDF text extraction and chunking
├── vector_store.py # Pinecone vector database integration
├── qa_system.py # Question-answering logic
├── pdf_search_engine.py # Main orchestration class
├── supabase_memory.py # Conversation memory with Supabase
├── requirements.txt # Python dependencies
├── .env.example # Environment variables template
├── setup_supabase.sql # Database schema for memory
├── .gitignore # Git ignore configuration
└── README.md # This file
# config.py
CHUNK_SIZE = 1000 # Adjust based on document complexity
CHUNK_OVERLAP = 200 # Increase for better context preservation
MAX_DOCUMENTS = 100 # Limit concurrent documents
CACHE_TTL = 3600 # Cache lifetime in seconds
- Recommended Pinecone tier: Standard or Enterprise
- Minimum RAM: 4GB
- Recommended CPU: 4 cores
- Storage: 10GB+ for document cache
-
PDF Processing Fails
- Ensure PDF is not password protected
- Check file permissions
- Verify PDF is not corrupted
-
Vector Store Errors
- Confirm Pinecone API key is valid
- Check index dimensions match configuration
- Verify network connectivity
-
Memory Issues
- Clear browser cache
- Restart application
- Check Supabase connection
We welcome contributions! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature
) - Commit changes (
git commit -m 'Add AmazingFeature'
) - Push to branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
- OpenAI team for their powerful language models
- Pinecone for vector search capabilities
- Supabase team for the excellent database platform
- LangChain community for the framework
- All contributors and users of this project
- 📧 Email: [email protected]
- 📝 Issues: GitHub Issues
Made with ❤️ by Muhammad Husnain Ali