This tool crawls a documentation website and converts the pages into a single Markdown document. It intelligently removes common sections that appear across multiple pages to avoid duplication, including them once at the end of the document.
Version 1.0.0 introduces significant improvements, including support for JavaScript-rendered pages using Playwright and a fully asynchronous implementation.
- JavaScript Rendering: Utilizes Playwright to accurately render pages that rely on JavaScript, ensuring complete and up-to-date content capture.
- Crawls documentation websites and combines pages into a single Markdown file.
- Removes common sections that appear across many pages, including them once at the end of the document.
- Customizable threshold for similarity to control deduplication sensitivity.
- Configurable selectors to remove specific elements from pages.
- Supports robots.txt compliance with an option to ignore it.
- Javascript rendering, waiting for page to stabilize before scraping.
- Asynchronous Operation: Fully asynchronous methods enhance performance and scalability during the crawling process.
- Python 3.7 or higher is required.
- (Optional) It is recommended to use a virtual environment to avoid dependency conflicts with other projects.
If you have already cloned the repository or downloaded the source code, you can install the package using pip
:
pip install .
This will install the package in your current Python environment.
If you are a developer or want to modify the source code and see your changes reflected immediately, you can install the package in editable mode. This allows you to edit the source files and test the changes without needing to reinstall the package:
pip install -e .
It is recommended to use a virtual environment to isolate the package and its dependencies. Follow these steps to set up a virtual environment and install the package:
-
Create a virtual environment (e.g., named
venv
):python -m venv venv
-
Activate the virtual environment:
-
On macOS/Linux:
source venv/bin/activate
-
On Windows:
.\venv\Scripts\activate
-
-
Install the package inside the virtual environment:
pip install .
This ensures that all dependencies are installed within the virtual environment.
After installing the package, you need to install the necessary Playwright browser binaries:
playwright install
This command downloads the required browser binaries (Chromium, Firefox, and WebKit) used by Playwright for rendering pages.
Once the package is published on PyPI, you can install it directly using:
pip install libcrawler
To upgrade the package to the latest version, use:
pip install --upgrade libcrawler
This will upgrade the package to the newest version available.
You can verify that the package has been installed correctly by running:
pip show libcrawler
This will display information about the installed package, including the version, location, and dependencies.
crawl-docs BASE_URL STARTING_POINT [OPTIONS]
BASE_URL
: The base URL of the documentation site (e.g., https://example.com).STARTING_POINT
: The starting path of the documentation (e.g., /docs/).
-o, --output OUTPUT
: Output filename (default: documentation.md).--no-robots
: Ignore robots.txt rules.--delay DELAY
: Delay between requests in seconds (default: 1.0).--delay-range DELAY_RANGE
: Range for random delay variation (default: 0.5).--remove-selectors SELECTOR [SELECTOR ...]
: Additional CSS selectors to remove from pages.--similarity-threshold SIMILARITY_THRESHOLD
: Similarity threshold for section comparison (default: 0.8).--allowed-paths PATH [PATH ...]
: List of URL paths to include during crawling.--ignore-paths PATH [PATH ...]
: List of URL paths to skip during crawling, either before or after fetching content.--user-agent USER_AGENT
: Specify a custom User-Agent string (which will be harmonized with any additional headers).--headers-file FILE
: Path to a JSON file containing optional headers. Only one of--headers-file
or--headers-json
can be used.--headers-json JSON
(JSON string): Optional headers as JSON.
crawl-docs https://example.com /docs/ -o output.md
crawl-docs https://example.com /docs/ -o output.md \
--similarity-threshold 0.7 \
--delay-range 0.3
crawl-docs https://example.com /docs/ -o output.md \
--remove-selectors ".sidebar" ".ad-banner"
crawl-docs https://example.com / -o output.md \
--allowed-paths "/docs/" "/api/"
crawl-docs https://example.com /docs/ -o output.md \
--ignore-paths "/old/" "/legacy/"
- Python 3.7 or higher
- BeautifulSoup4 for HTML parsing.
- markdownify for converting HTML to Markdown.
- Playwright for headless browser automation and JavaScript rendering.
- aiofiles for asynchronous file operations.
- Additional dependencies are listed in
requirements.txt
.
After setting up your environment, install all required dependencies using:
pip install -r requirements.txt
Note: Ensure you have installed the Playwright browsers by running playwright install
as mentioned in the Installation section.
This project is licensed under the LGPLv3. See the [LICENSE](LICENSE) file for details.
Contributions are welcome! Please follow these steps to contribute:
- Fork the repository on GitHub.
- Clone your fork to your local machine:
git clone https://github.com/your-username/libcrawler.git
- Create a new branch for your feature or bugfix:
git checkout -b feature-name
- Make your changes and commit them with clear messages:
git commit -m "Add feature X"
- Push your changes to your fork:
git push origin feature-name
- Open a Pull Request on the original repository describing your changes.
Please ensure your code adheres to the project's coding standards and includes appropriate tests.
- BeautifulSoup for HTML parsing.
- Playwright for headless browser automation.
- Markdownify for converting HTML to Markdown.
- aiofiles for asynchronous file operations.