Skip to content

Lightweight PDF Q&A tool powered by RAG (Retrieval-Augmented Generation) with MCP (Model Context Protocol) Support.

License

Notifications You must be signed in to change notification settings

MIMICLab/DocsRay

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DocsRay

PyPI Status license Downloads arXiv Verified on MseeP

🌐 Live Demo (Base Model)

A powerful Universal Document Question-Answering System that uses advanced embedding models and multimodal LLMs with Coarse-to-Fine search (RAG) approach. Features seamless MCP (Model Context Protocol) integration with Claude Desktop, comprehensive directory management capabilities, visual content analysis, and intelligent hybrid OCR system.

πŸš€ Quick Start

DocsRay now features automatic setup! Simply install and it will handle dependencies and download the lite model automatically.

# Install DocsRay
pip install docsray

That's it! DocsRay will automatically:

  • Install system dependencies
  • Download the lite model (~3GB)
  • Configure the environment

Manual Setup (if automatic setup fails)

If the automatic setup doesn't work properly, you can run the setup manually:

# 1. Install DocsRay
pip install docsray

# 2. Run setup (REQUIRED)
docsray setup
# This will:
# - Detect your GPU (NVIDIA CUDA, Apple Metal, or CPU)
# - Install the optimized llama-cpp-python for your platform
# - Install ffmpeg for audio/video processing
# - Show additional recommendations for your OS (including LibreOffice)

# 3. Download models (default: lite)
docsray download-models --model-type lite   # 4b model (~3GB)
# docsray download-models --model-type base  # 12b model (~8GB) 
# docsray download-models --model-type pro   # 27b model (~16GB)

# 4. (Optional but Recommended) Install LibreOffice for better office document conversion
# The setup command above will show you the appropriate installation command for your OS:
# Ubuntu/Debian: sudo apt-get install libreoffice libreoffice-l10n-ko
# macOS: brew install libreoffice
# Windows: Download from https://www.libreoffice.org/download/
# For HWP support, install h2orestart extension: https://extensions.libreoffice.org/en/extensions/show/27504

Optional Components

Audio/Video Processing (Optional)

# FFmpeg for video processing
# Ubuntu/Debian
sudo apt-get install ffmpeg

# macOS
brew install ffmpeg

# CentOS/RHEL
sudo yum install epel-release
sudo yum install ffmpeg

# Windows (via Chocolatey)
choco install ffmpeg

Additional Format Support

# For pandoc-based conversions
# Ubuntu/Debian
sudo apt-get install pandoc

# macOS
brew install pandoc

# For Korean fonts (better HWP rendering)
# Ubuntu/Debian
sudo apt-get install fonts-nanum fonts-nanum-coding fonts-nanum-extra

Tesseract OCR (for enhanced OCR performance)

# Ubuntu/Debian
sudo apt-get install tesseract-ocr tesseract-ocr-kor

# macOS
brew install tesseract tesseract-lang

Start Using DocsRay

docsray web                                 # Launch Web UI
docsray api                                 # Start API server
docsray configure-claude                    # MCP for Claude Desktop

πŸ“‹ Core Features

  • 🧠 Advanced RAG System: Coarse-to-Fine search for accurate document retrieval
  • πŸ‘οΈ Multimodal AI: Visual content analysis using Gemma-3 vision capabilities
  • πŸ”„ Hybrid OCR: Intelligent selection between AI-powered OCR and Pytesseract
  • ⚑ Adaptive Performance: Automatically optimizes based on system resources
  • 🎯 Flexible Model Selection: Choose between lite (4b), base (12b), and pro (27b) models
  • πŸ”Œ MCP Integration: Seamless integration with Claude Desktop
  • 🌐 Multiple Interfaces: Web UI, API server, CLI, and MCP server
  • πŸ“ Universal Document Support: 30+ file formats with automatic conversion
  • 🌍 Multi-Language: Korean, English, and other languages supported

🎯 What's New

v1.9.0: Enhanced Document Conversion

  • LibreOffice Integration: Better quality conversions for Office documents when LibreOffice is installed
  • Improved Format Support: Enhanced handling of DOCX, XLSX, PPTX, ODT, ODS, ODP, HWP/HWPX

v1.8.0: Multimedia Support

  • Video/Audio Processing: Extract and analyze content from video and audio files
  • Automatic Setup: DocsRay now automatically installs dependencies and downloads models

Recent Updates

  • Auto-restart capability for all servers
  • Enhanced embedding method (v1.7.0) - requires reindexing existing documents

For detailed changelog, see CHANGELOG.md

πŸ“– Usage Guide

Model Management

# Download specific model type
docsray download-models --model-type lite   # Fast, lower quality
docsray download-models --model-type base   # Balanced performance
docsray download-models --model-type pro    # Best quality, slower

# Force re-download existing models
docsray download-models --model-type base --force

# Check model status
docsray download-models --check

Document Processing

# Process any document type
docsray process document.pdf --model-type base
docsray process report.docx --timeout 300
docsray process spreadsheet.xlsx --no-visuals

# Ask questions about processed documents
docsray ask document.pdf "What are the key findings?"
docsray ask report.docx "Summarize the conclusions" --model-type pro

Web Interface

# Basic web interface
docsray web

# Advanced options
docsray web --model-type base --port 8080
docsray web --auto-restart                    # Auto-restart with unlimited retries
docsray web --auto-restart --max-retries 5    # Auto-restart with 5 retry limit
docsray web --timeout 300 --pages 10          # Process max 10 pages, 5min timeout

API Server

# Start API server
docsray api --port 8000

# With auto-restart and timeout
docsray api --auto-restart                     # Unlimited retries
docsray api --auto-restart --timeout 600       # 10min timeout per request

# API accepts document paths per request
curl -X POST http://localhost:8000/ask \
  -H "Content-Type: application/json" \
  -d '{
    "document_path": "/path/to/document.pdf",
    "question": "What is the main topic?",
    "use_coarse_search": true
  }'

# Check cache info and clear if needed
curl http://localhost:8000/cache/info
curl -X POST http://localhost:8000/cache/clear

Performance Testing

# Basic performance test
docsray perf-test document.pdf "What is this about?"

# Advanced testing
docsray perf-test document.pdf "Analyze key points" \
  --iterations 5 --port 8000 --host localhost

# With custom timeout
docsray perf-test document.pdf "What is this?" --timeout 600

MCP Integration (Claude Desktop)

# Configure Claude Desktop
docsray configure-claude

# Start MCP server
docsray mcp --auto-restart

πŸ“ Supported File Formats

Office Documents: Word (.docx, .doc), Excel (.xlsx, .xls), PowerPoint (.pptx, .ppt)
Text Formats: Plain Text (.txt), Markdown (.md), HTML (.html)
Images: JPEG, PNG, GIF, BMP, TIFF, WebP
Korean Documents: HWP (.hwp, .hwpx)
PDFs: Native PDF support with visual analysis
Audio: MP3, WAV, M4A, FLAC, OGG, WMA, AAC (requires ffmpeg)
Video: MP4, AVI, MOV, WMV, FLV, MKV, WebM, M4V, MPG, MPEG (requires ffmpeg)

πŸ› οΈ Advanced Configuration

Environment Variables

export DOCSRAY_MODEL_TYPE=base           # Set default model type
export DOCSRAY_DISABLE_VISUALS=1         # Disable visual analysis
export DOCSRAY_DEBUG=1                   # Enable debug mode
export DOCSRAY_HOME=/custom/path         # Custom data directory

Python API

from docsray import PDFChatBot
from docsray.scripts import pdf_extractor, chunker, build_index, section_rep_builder

# Process document
extracted = pdf_extractor.extract_content("document.pdf", analyze_visuals=True)
chunks = chunker.process_extracted_file(extracted)
chunk_index = build_index.build_chunk_index(chunks)
sections = section_rep_builder.build_section_reps(extracted["sections"], chunk_index)

# Create chatbot and ask questions
chatbot = PDFChatBot(sections, chunk_index)
answer, references = chatbot.answer("What are the key points?")

πŸ”§ System Requirements

Hardware Requirements

  • CPU Mode: Any system with 4GB+ RAM
  • GPU Acceleration: CUDA-compatible GPU or Apple Silicon (MPS)
  • Storage: 3-16GB depending on model type chosen

Performance Modes (Auto-detected)

System Memory Mode Models Max Tokens
< 16GB FAST Q4 quantized 8K
16-32GB STANDARD Q8 quantized 16K
> 32GB FULL_FEATURE F16 precision 32K

πŸ› Troubleshooting

Common Issues

# Check system status
docsray download-models --check

# Re-download corrupted models
docsray download-models --force

# Debug mode for detailed logs
DOCSRAY_DEBUG=1 docsray web

Performance Issues

  • Use --model-type lite for faster processing
  • Enable --no-visuals for text-only documents
  • Increase --timeout for large documents
  • Use auto-restart for stability: --auto-restart

πŸ“Š Performance Benchmarks

Run your own benchmarks:

# Test API performance
docsray perf-test document.pdf "test question" --iterations 10

# Compare model types
docsray perf-test document.pdf "test question" --model-type lite
docsray perf-test document.pdf "test question" --model-type base

🀝 Contributing

We welcome contributions! Please check our GitHub repository for:

  • Bug reports and feature requests
  • Code contributions and pull requests
  • Documentation improvements

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Open Source Dependencies

DocsRay is built on top of these excellent open source projects:

  • llama.cpp - GGML/GGUF model inference (MIT License)
  • PyMuPDF - PDF processing (AGPL-3.0 License)
  • pdfplumber - PDF text extraction (MIT License)
  • FastAPI - Web framework (MIT License)
  • Gradio - Web UI components (Apache-2.0 License)
  • OpenCV - Image processing (Apache-2.0 License)
  • faster-whisper - Audio transcription (MIT License)
  • Pandas - Data manipulation (BSD-3-Clause License)
  • NumPy - Numerical computing (BSD-3-Clause License)
  • scikit-learn - Machine learning utilities (BSD-3-Clause License)

πŸ”— Links

About

Lightweight PDF Q&A tool powered by RAG (Retrieval-Augmented Generation) with MCP (Model Context Protocol) Support.

Resources

License

Stars

Watchers

Forks

Packages

No packages published