🔍 AI PDF Search Engine

A powerful AI-powered PDF search and question-answering system that leverages cutting-edge technologies to provide intelligent document analysis and natural language interactions.

Author: Muhammad Husnain Ali

🛠️ Technologies Used

Core Technologies

Python - Primary programming language
LangChain - Framework for building LLM applications
OpenAI GPT-3.5 - Large Language Model for text processing
Streamlit - Web application framework

Data Processing & Storage

Pinecone - Vector database for similarity search
Supabase - PostgreSQL database for conversation history
PyPDF2 - PDF processing library

AI/ML Components

OpenAI Embeddings - text-embedding-3-small model (512 dimensions)
Vector Search - Semantic similarity matching
Conversation Memory - Context-aware chat history

Development Tools

Python Virtual Environment - Dependency isolation
Environment Variables - Secure configuration management
SQL - Database schema management

🚀 Features

Advanced PDF Processing
- Automatic text extraction and semantic chunking
- Support for multiple PDF uploads
- Intelligent document metadata preservation
- OCR support for scanned documents
AI-Powered Question Answering
- Natural language understanding
- Context-aware responses
- Multi-document correlation
- Source attribution with page numbers
Enterprise-Grade Vector Search
- High-performance similarity matching
- Scalable document indexing
- Real-time search capabilities
- Configurable search parameters
Smart Conversation Management
- Persistent chat history
- Context retention across sessions
- Conversation summarization
- Multi-user support
Modern Web Interface
- Responsive design
- Dark/Light mode support
- Real-time updates
- Mobile-friendly layout

🏗️ Architecture

Frontend: Streamlit web interface
LLM: OpenAI GPT-3.5-turbo for intelligent responses
Embeddings: OpenAI text-embedding-3-small (512 dimensions)
Vector Store: Pinecone for document similarity search
Memory: Supabase PostgreSQL for conversation persistence
PDF Processing: PyPDF2 with intelligent text chunking

⚙️ Requirements

Python 3.8+
OpenAI API key
Pinecone account and API key
Supabase project (for conversation memory)

🚀 Quick Setup

1. Clone and Setup Virtual Environment

# Clone the repository
git clone <repository-url>
cd ai-pdf-search-engine

# Create and activate virtual environment
# For Windows
python -m venv venv
.\venv\Scripts\activate

# For macOS/Linux
python -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

2. Configure Environment

Create .env file in the project root:

OPENAI_API_KEY=your_openai_api_key_here
PINECONE_API_KEY=your_pinecone_api_key_here
PINECONE_INDEX_NAME=your_index_name
EMBEDDING_MODEL=text-embedding-3-small
EMBEDDING_DIMENSION=512
SUPABASE_URL=your_supabase_project_url
SUPABASE_KEY=your_supabase_anon_key

3. Setup Supabase Tables

Navigate to your Supabase project dashboard
Go to the SQL Editor
Open the provided setup_supabase.sql file in the project root
Execute the SQL commands to:
- Create chat sessions and messages tables
- Set up appropriate indexes
- Enable Row Level Security (RLS)
- Configure access policies

The SQL file includes all necessary table definitions, indexes, and security policies for the chat system.

4. Run Application

# Make sure your virtual environment is activated
# For Windows
.\venv\Scripts\activate

# For macOS/Linux
source venv/bin/activate

# Run the application
streamlit run app.py

5. Deactivating Virtual Environment

When you're done working on the project, you can deactivate the virtual environment:

deactivate

🏗️ Project Structure

ai-pdf-search-engine/
├── app.py                 # Streamlit web interface
├── config.py             # Configuration and environment variables
├── pdf_processor.py      # PDF text extraction and chunking
├── vector_store.py       # Pinecone vector database integration
├── qa_system.py          # Question-answering logic
├── pdf_search_engine.py  # Main orchestration class
├── supabase_memory.py    # Conversation memory with Supabase
├── requirements.txt      # Python dependencies
├── .env.example         # Environment variables template
├── setup_supabase.sql   # Database schema for memory
├── .gitignore          # Git ignore configuration
└── README.md           # This file

💡 Advanced Configuration

Performance Tuning

# config.py
CHUNK_SIZE = 1000          # Adjust based on document complexity
CHUNK_OVERLAP = 200        # Increase for better context preservation
MAX_DOCUMENTS = 100        # Limit concurrent documents
CACHE_TTL = 3600          # Cache lifetime in seconds

Scaling Considerations

Recommended Pinecone tier: Standard or Enterprise
Minimum RAM: 4GB
Recommended CPU: 4 cores
Storage: 10GB+ for document cache

🔧 Troubleshooting

Common Issues

PDF Processing Fails
- Ensure PDF is not password protected
- Check file permissions
- Verify PDF is not corrupted
Vector Store Errors
- Confirm Pinecone API key is valid
- Check index dimensions match configuration
- Verify network connectivity
Memory Issues
- Clear browser cache
- Restart application
- Check Supabase connection

🤝 Contributing

We welcome contributions! Please follow these steps:

Fork the repository
Create a feature branch (git checkout -b feature/AmazingFeature)
Commit changes (git commit -m 'Add AmazingFeature')
Push to branch (git push origin feature/AmazingFeature)
Open a Pull Request

🙏 Acknowledgments

OpenAI team for their powerful language models
Pinecone for vector search capabilities
Supabase team for the excellent database platform
LangChain community for the framework
All contributors and users of this project

📞 Support

📧 Email: [email protected]
📝 Issues: GitHub Issues

Made with ❤️ by Muhammad Husnain Ali

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🔍 AI PDF Search Engine

🛠️ Technologies Used

Core Technologies

Data Processing & Storage

AI/ML Components

Development Tools

🚀 Features

🏗️ Architecture

⚙️ Requirements

🚀 Quick Setup

1. Clone and Setup Virtual Environment

2. Configure Environment

3. Setup Supabase Tables

4. Run Application

5. Deactivating Virtual Environment

🏗️ Project Structure

💡 Advanced Configuration

Performance Tuning

Scaling Considerations

🔧 Troubleshooting

Common Issues

🤝 Contributing

🙏 Acknowledgments

📞 Support

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
app.py		app.py
config.py		config.py
pdf_processor.py		pdf_processor.py
pdf_search_engine.py		pdf_search_engine.py
qa_system.py		qa_system.py
requirements.txt		requirements.txt
setup_supabase.sql		setup_supabase.sql
supabase_memory.py		supabase_memory.py
vector_store.py		vector_store.py

M-Husnain-Ali/AI-PDF-Search-Engine

Folders and files

Latest commit

History

Repository files navigation

🔍 AI PDF Search Engine

🛠️ Technologies Used

Core Technologies

Data Processing & Storage

AI/ML Components

Development Tools

🚀 Features

🏗️ Architecture

⚙️ Requirements

🚀 Quick Setup

1. Clone and Setup Virtual Environment

2. Configure Environment

3. Setup Supabase Tables

4. Run Application

5. Deactivating Virtual Environment

🏗️ Project Structure

💡 Advanced Configuration

Performance Tuning

Scaling Considerations

🔧 Troubleshooting

Common Issues

🤝 Contributing

🙏 Acknowledgments

📞 Support

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages