Doc Scraper

A flexible documentation crawler that can scrape and process documentation from any website.

Installation

First install dependencies:

pip install -r requirements.txt

Then install the package in editable mode:

pip install -e .

The -e flag installs the package in "editable" mode, which means:

The package is installed in your Python environment
Python looks for the package in your current directory instead of copying files
Changes to the source code take effect immediately without reinstalling
Required for running the package as a module with python -m

Environment Setup

Create a .env file in the project root:

OPENAI_API_KEY=your_api_key_here

⚠️ The OpenAI API key is required for the crawler to process documentation.

Usage

Run the scraper with a URL from the src directory:

cd src
python main.py https://docs.example.com

Optional Arguments

-o, --output: Output directory (default: output_docs)
-m, --max-pages: Maximum pages to scrape (default: 1000)
-c, --concurrent: Number of concurrent pages to scrape (default: 1)

Example with all options:

python main.py https://docs.example.com -o my_docs -m 500 -c 2

Troubleshooting

If you get a "ModuleNotFoundError", make sure you:

Have run pip install -e . from the project root
Are running the command from the src directory

Configuration

The crawler accepts the following parameters:

base_url: The starting URL to crawl
output_dir: Directory where scraped docs will be saved
max_pages: Maximum number of pages to crawl
max_concurrent_pages: Number of concurrent pages to process

Requirements

Python 3.8+
Chrome/Chromium browser (for Selenium)

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
src		src
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Doc Scraper

Installation

Environment Setup

Usage

Optional Arguments

Troubleshooting

Configuration

Requirements

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Compiler-Inc/doc-scraper

Folders and files

Latest commit

History

Repository files navigation

Doc Scraper

Installation

Environment Setup

Usage

Optional Arguments

Troubleshooting

Configuration

Requirements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages