Website Downloader CLI is a tiny, pure-Python site-mirroring tool that lets you grab a complete, browsable offline copy of any publicly reachable website:
- Recursively crawls every same-origin link (including “pretty” /about/URLs)
- Downloads all assets (images, CSS, JS, …)
- Rewrites internal links so pages open flawlessly from your local disk
- Streams files concurrently with automatic retry / back-off
- Generates a clean, flat directory tree (example_com/index.html,example_com/about/index.html, …)
- Handles extremely long filenames safely via hashing and graceful fallbacks
Perfect for web archiving, pentesting labs, long flights, or just poking around a site without an internet connection.
# 1. Grab the code
git clone https://github.com/PKHarsimran/website-downloader.git
cd website-downloader
# 2. Install dependencies (only two runtime libs!)
pip install -r requirements.txt
# 3. Mirror a site – no prompts needed
python website-downloader.py \
    --url https://harsim.ca \
    --destination harsim_ca_backup \
    --max-pages 100 \
    --threads 8| Library | Emoji | Purpose in this project | 
|---|---|---|
| requests + urllib3.Retry | 🌐 | High-level HTTP client with automatic retry / back-off for flaky hosts | 
| BeautifulSoup (bs4) | 🍜 | Parses downloaded HTML and extracts every <a>,<img>,<script>, and<link> | 
| argparse | 🛠️ | Powers the modern CLI ( --url,--destination,--max-pages,--threads, …) | 
| logging | 📝 | Dual console / file logging with colour + crawl-time stats | 
| threading & queue | ⚙️ | Lightweight thread-pool that streams images/CSS/JS concurrently | 
| pathlib & os | 📂 | Cross-platform file-system helpers ( Pathmagic, directory creation, etc.) | 
| time | ⏱️ | Measures per-page latency and total crawl duration | 
| urllib.parse | 🔗 | Safely joins / analyses URLs and rewrites them to local relative paths | 
| sys | 🖥️ | Directs log output to stdoutand handles graceful interrupts (Ctrl-C) | 
| Path | What it is | Key features | 
|---|---|---|
| website_downloader.py | Single-entry CLI that performs the entire crawl and link-rewriting pipeline. | • Persistent requests.Sessionwith automatic retries• Breadth-first crawl capped by --max-pages(default = 50)• Thread-pool (configurable via --threads, default = 6) to fetch images/CSS/JS in parallel• Robust link rewriting so every internal URL works offline (pretty-URL folders ➜ index.html, plain paths ➜.html)• Smart output folder naming ( example.com→example_com)• Colourised console + file logging with per-page latency and crawl summary | 
| requirements.txt | Minimal dependency pin-list. Only requestsandbeautifulsoup4are third-party; everything else is Python ≥ 3.10 std-lib. | |
| web_scraper.log | Auto-generated run log (rotates/overwrites on each invocation). Useful for troubleshooting or audit trails. | |
| README.md | The document you’re reading – quick-start, flags, and architecture notes. | |
| (output folder) | Created at runtime ( example_com/ …) – mirrors the remote directory tree withindex.htmlstubs and all static assets. | 
Removed: The old
check_download.pyverifier is no longer required because the new downloader performs integrity checks (missing files, broken internal links) during the crawl and reports any issues directly in the log summary.
✅ Type Conversion Fix Fixed a TypeError caused by int(..., 10) when non-string arguments were passed.
✅ Safer Path Handling Added intelligent path shortening and hashing for long filenames to prevent OSError: [Errno 36] File name too long errors.
✅ Improved CLI Experience Rebuilt argument parsing with argparse for cleaner syntax and validation.
✅ Code Quality & Linting Applied Black + Flake8 formatting; the project now passes all CI lint checks.
✅ Logging & Stability Improved error handling, logging, and fallback mechanisms for failed writes.
✅ Skip Non-Fetchable Schemes
The crawler now safely skips mailto:, tel:, javascript:, and data: links instead of trying to download them.
This prevents requests.exceptions.InvalidSchema: No connection adapters were found errors and keeps those links intact in saved HTML.
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes.
This project is licensed under the MIT License.