A flexible documentation crawler that can scrape and process documentation from any website.
First install dependencies:
pip install -r requirements.txt
Then install the package in editable mode:
pip install -e .
The -e
flag installs the package in "editable" mode, which means:
- The package is installed in your Python environment
- Python looks for the package in your current directory instead of copying files
- Changes to the source code take effect immediately without reinstalling
- Required for running the package as a module with
python -m
Create a .env
file in the project root:
OPENAI_API_KEY=your_api_key_here
Run the scraper with a URL from the src
directory:
cd src
python main.py https://docs.example.com
-o, --output
: Output directory (default: output_docs)-m, --max-pages
: Maximum pages to scrape (default: 1000)-c, --concurrent
: Number of concurrent pages to scrape (default: 1)
Example with all options:
python main.py https://docs.example.com -o my_docs -m 500 -c 2
If you get a "ModuleNotFoundError", make sure you:
- Have run
pip install -e .
from the project root - Are running the command from the
src
directory
The crawler accepts the following parameters:
base_url
: The starting URL to crawloutput_dir
: Directory where scraped docs will be savedmax_pages
: Maximum number of pages to crawlmax_concurrent_pages
: Number of concurrent pages to process
- Python 3.8+
- Chrome/Chromium browser (for Selenium)