🕵️‍♂️ Web OCR + Search CLI Tool

This tool crawls a specific web path (and its subdirectories only), downloads all PDFs, extracts searchable text (even from scanned PDFs using OCR), and lets you search contents via CLI using full-text search (Whoosh).

⚙️ Features

🌐 Crawls only the given path and subdirectories — never leaks upward.
📥 Downloads .pdf files concurrently (async + multithreaded).
🧠 Extracts text from:
- Standard digital PDFs (using PyMuPDF).
- Scanned/image PDFs (using Tesseract OCR + pdf2image).
🔎 Indexes all content using Whoosh for lightning-fast text search.
🖥️ Command-line search interface with filename, path, and snippet.

🧪 Tested On

Python 3.8+
OS: macOS / Linux / Windows

📦 Installation

1. Clone the repo

git clone https://github.com/koushikEng/web-ocr2.git
cd web-ocr2

2. Install Python dependencies

pip install -r requirements.txt

Or manually:

pip install aiohttp beautifulsoup4 pytesseract pdf2image PyMuPDF whoosh

3. Install system dependencies

🧠 Tesseract OCR

macOS: brew install tesseract
Ubuntu: sudo apt install tesseract-ocr
Windows: Download installer → Add to PATH

📄 Poppler (for PDF to image)

macOS: brew install poppler
Ubuntu: sudo apt install poppler-utils
Windows: Download binaries → Add bin/ to PATH

🛠️ Usage

Run the app

python main.py

What it does:

Prompts you for a website path like:
```
https://example.com/files/
```
Crawls only within /files/ and its subfolders.
Downloads all .pdf files into ./downloads/
Extracts text using:
- Native text extraction (PyMuPDF)
- Fallback OCR (Tesseract)
Indexes the files for fast search using Whoosh.

Launches an interactive CLI:

🔍 Type your search queries below (type 'exit' to quit):
Search > climate report 2023

📂 Project Structure

web-ocr2/
├── main.py             # Orchestrates everything
├── crawler.py          # Async crawler (restricted to base path)
├── downloader.py       # Async downloader with concurrency
├── ocr_engine.py       # Extracts text from PDFs (with OCR fallback)
├── indexer.py          # Whoosh indexing & search
├── search_cli.py       # Command-line interface
├── downloads/          # All downloaded files stored here
├── index/              # Whoosh index files stored here
├── README.md

✅ Notes & Tips

OCR is only used if native PDF text is missing.
You can adjust file types (.pdf) in main.py and crawler.py for .jpg, .png, etc.
Modify search_index() in indexer.py to improve query fuzziness or scoring.

🤝 License

MIT — use freely, modify wildly, credit kindly.

🙋‍♂️ Author

Built by someone who prefers tools that just work, fast.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🕵️‍♂️ Web OCR + Search CLI Tool

⚙️ Features

🧪 Tested On

📦 Installation

1. Clone the repo

2. Install Python dependencies

3. Install system dependencies

🧠 Tesseract OCR

📄 Poppler (for PDF to image)

🛠️ Usage

Run the app

What it does:

📂 Project Structure

✅ Notes & Tips

🤝 License

🙋‍♂️ Author

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
crawler.py		crawler.py
downloader.py		downloader.py
indexer.py		indexer.py
main.py		main.py
ocr_engine.py		ocr_engine.py
requirements.txt		requirements.txt
search_cli.py		search_cli.py

KoushikEng/web-ocr2

Folders and files

Latest commit

History

Repository files navigation

🕵️‍♂️ Web OCR + Search CLI Tool

⚙️ Features

🧪 Tested On

📦 Installation

1. Clone the repo

2. Install Python dependencies

3. Install system dependencies

🧠 Tesseract OCR

📄 Poppler (for PDF to image)

🛠️ Usage

Run the app

What it does:

📂 Project Structure

✅ Notes & Tips

🤝 License

🙋‍♂️ Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages