Skip to content

A fast, CLI-based tool to crawl a website path, download PDFs, OCR scanned files, and search text content using Whoosh indexing.

Notifications You must be signed in to change notification settings

KoushikEng/web-ocr2

Repository files navigation

🕵️‍♂️ Web OCR + Search CLI Tool

This tool crawls a specific web path (and its subdirectories only), downloads all PDFs, extracts searchable text (even from scanned PDFs using OCR), and lets you search contents via CLI using full-text search (Whoosh).


⚙️ Features

  • 🌐 Crawls only the given path and subdirectories — never leaks upward.
  • 📥 Downloads .pdf files concurrently (async + multithreaded).
  • 🧠 Extracts text from:
    • Standard digital PDFs (using PyMuPDF).
    • Scanned/image PDFs (using Tesseract OCR + pdf2image).
  • 🔎 Indexes all content using Whoosh for lightning-fast text search.
  • 🖥️ Command-line search interface with filename, path, and snippet.

🧪 Tested On

  • Python 3.8+
  • OS: macOS / Linux / Windows

📦 Installation

1. Clone the repo

git clone https://github.com/koushikEng/web-ocr2.git
cd web-ocr2

2. Install Python dependencies

pip install -r requirements.txt

Or manually:

pip install aiohttp beautifulsoup4 pytesseract pdf2image PyMuPDF whoosh

3. Install system dependencies

🧠 Tesseract OCR

  • macOS: brew install tesseract
  • Ubuntu: sudo apt install tesseract-ocr
  • Windows: Download installer → Add to PATH

📄 Poppler (for PDF to image)

  • macOS: brew install poppler
  • Ubuntu: sudo apt install poppler-utils
  • Windows: Download binaries → Add bin/ to PATH

🛠️ Usage

Run the app

python main.py

What it does:

  1. Prompts you for a website path like:

    https://example.com/files/
    
  2. Crawls only within /files/ and its subfolders.

  3. Downloads all .pdf files into ./downloads/

  4. Extracts text using:

    • Native text extraction (PyMuPDF)
    • Fallback OCR (Tesseract)
  5. Indexes the files for fast search using Whoosh.

  6. Launches an interactive CLI:

    🔍 Type your search queries below (type 'exit' to quit):
    Search > climate report 2023
    

📂 Project Structure

web-ocr2/
├── main.py             # Orchestrates everything
├── crawler.py          # Async crawler (restricted to base path)
├── downloader.py       # Async downloader with concurrency
├── ocr_engine.py       # Extracts text from PDFs (with OCR fallback)
├── indexer.py          # Whoosh indexing & search
├── search_cli.py       # Command-line interface
├── downloads/          # All downloaded files stored here
├── index/              # Whoosh index files stored here
├── README.md

✅ Notes & Tips

  • OCR is only used if native PDF text is missing.
  • You can adjust file types (.pdf) in main.py and crawler.py for .jpg, .png, etc.
  • Modify search_index() in indexer.py to improve query fuzziness or scoring.

🤝 License

MIT — use freely, modify wildly, credit kindly.


🙋‍♂️ Author

Built by someone who prefers tools that just work, fast.

About

A fast, CLI-based tool to crawl a website path, download PDFs, OCR scanned files, and search text content using Whoosh indexing.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages