This tool crawls a specific web path (and its subdirectories only), downloads all PDFs, extracts searchable text (even from scanned PDFs using OCR), and lets you search contents via CLI using full-text search (Whoosh).
- 🌐 Crawls only the given path and subdirectories — never leaks upward.
- 📥 Downloads
.pdffiles concurrently (async + multithreaded). - 🧠 Extracts text from:
- Standard digital PDFs (using PyMuPDF).
- Scanned/image PDFs (using Tesseract OCR + pdf2image).
- 🔎 Indexes all content using Whoosh for lightning-fast text search.
- 🖥️ Command-line search interface with filename, path, and snippet.
- Python 3.8+
- OS: macOS / Linux / Windows
git clone https://github.com/koushikEng/web-ocr2.git
cd web-ocr2pip install -r requirements.txtOr manually:
pip install aiohttp beautifulsoup4 pytesseract pdf2image PyMuPDF whoosh- macOS:
brew install tesseract - Ubuntu:
sudo apt install tesseract-ocr - Windows: Download installer → Add to PATH
- macOS:
brew install poppler - Ubuntu:
sudo apt install poppler-utils - Windows: Download binaries → Add
bin/to PATH
python main.py-
Prompts you for a website path like:
https://example.com/files/ -
Crawls only within
/files/and its subfolders. -
Downloads all
.pdffiles into./downloads/ -
Extracts text using:
- Native text extraction (PyMuPDF)
- Fallback OCR (Tesseract)
-
Indexes the files for fast search using Whoosh.
-
Launches an interactive CLI:
🔍 Type your search queries below (type 'exit' to quit): Search > climate report 2023
web-ocr2/
├── main.py # Orchestrates everything
├── crawler.py # Async crawler (restricted to base path)
├── downloader.py # Async downloader with concurrency
├── ocr_engine.py # Extracts text from PDFs (with OCR fallback)
├── indexer.py # Whoosh indexing & search
├── search_cli.py # Command-line interface
├── downloads/ # All downloaded files stored here
├── index/ # Whoosh index files stored here
├── README.md
- OCR is only used if native PDF text is missing.
- You can adjust file types (
.pdf) inmain.pyandcrawler.pyfor.jpg,.png, etc. - Modify
search_index()inindexer.pyto improve query fuzziness or scoring.
MIT — use freely, modify wildly, credit kindly.
Built by someone who prefers tools that just work, fast.