This repository embarks on an expedition to discover and explore content from two major Python-related websites: python.org and wiki.python.org. We leverage two powerful tools to achieve this:
-
Safarnama:
Inspired by Nasir Khusraw’s timeless travelogue, Safarnama is a web crawling tool that journeys through websites, cleans up HTML content, and uses a language model to generate summaries and extract key tags. It works both from the command line and programmatically in Python. -
Woodsman:
Named after Kino’s formidable weapon, Woodsman is a generic SQL database viewer built with Streamlit. It lets you navigate and conquer your SQL databases by reflecting schemas, interactively exploring table data, filtering results, and exporting data in various formats.
We also extend our thanks to jinaai for providing the readerlm-v2 model. This model is used to summarize content and extract tags. Please note that the model is licensed under a CC Non-Commercial license.
.
├── README.md
├── config_python.yaml # Configuration for crawling python.org
├── config_wiki.yaml # Configuration for crawling wiki.python.org (using depth 4)
├── data
│ ├── python_org.db # SQLite DB generated by Safarnama (crawling python.org)
│ ├── python_org.log # Log file for python.org crawl
│ ├── python_org.md # Markdown file generated from the DB by Woodsman; includes detailed information (URLs, summaries, tags) for python.org (depth 2)
│ ├── sitemap_python.xml # Sitemap generated for python.org
│ ├── wiki.db # SQLite DB generated by Safarnama (crawling wiki.python.org)
│ ├── wiki.log # Log file for wiki.python.org crawl
│ └── wiki.md # Markdown file in progress for wiki.python.org (depth 4 crawl & analysis)
├── hello.py # Python script that loads configurations and runs the crawlers sequentially
├── pyproject.toml # Project configuration for dependency management
└── uv.lock # Lock file for uv package management
-
Crawling with Safarnama (Python 3.10+):
-
python.org Crawl:
The crawler usesconfig_python.yaml
with a depth setting of 2. Results are stored in the SQLite databasedata/python_org.db
and are subsequently processed to generatedata/python_org.md
. This markdown file contains detailed information—URLs, summaries, and tags—extracted from the database. Users can later use Woodsman to filter and refine this data, tailoring it to their specific needs. -
wiki.python.org Crawl:
The crawler for wiki.python.org is configured viaconfig_wiki.yaml
and uses a deeper crawl (depth 4). Data from this crawl is stored indata/wiki.db
and logged todata/wiki.log
. The output is being consolidated intodata/wiki.md
as the crawl progresses and analysis is refined.
-
-
Exploring Data with Woodsman:
- Woodsman lets you navigate the data stored in the SQLite databases (e.g.,
python_org.db
andwiki.db
). You can interactively filter the detailed information—such as URLs, summaries, and tags—provided in the markdown files (e.g.,data/python_org.md
) and export your filtered, targeted version in formats like JSON, CSV, or Markdown.
- Woodsman lets you navigate the data stored in the SQLite databases (e.g.,
-
Running the Code:
- The
hello.py
script demonstrates how to load the configurations and run the Safarnama crawlers sequentially. It first processes python.org and then continues with wiki.python.org.
- The
- Python 3.10+
- Install Safarnama and Woodsman via pip, Poetry, or your preferred package manager.
For example, using pip:
pip install safarnama woodsman
-
Update the Configuration:
- Ensure
config_python.yaml
points tohttps://python.org
. - Ensure
config_wiki.yaml
points tohttps://wiki.python.org
and setmax_depth: 4
.
- Ensure
-
Execute the Script:
Run the
hello.py
script to start the crawling process:python hello.py
This script will:
- Crawl python.org and store results in the
data
folder. - Crawl wiki.python.org with a deeper exploration (depth 4) while processing the data as the crawl progresses.
- Crawl python.org and store results in the
-
Explore with Woodsman:
Launch Woodsman to interactively filter and navigate the crawled data:
woodsman
Use Woodsman to load your SQLite databases (e.g.,
python_org.db
orwiki.db
), apply filters, and export a targeted version of your data.
Contributions to this project, as well as to Safarnama and Woodsman, are welcome. Please refer to their respective repositories for contribution guidelines:
Feel free to fork this repository, make improvements, and submit a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.
The readerlm-v2 model is provided by jinaai under a CC Non-Commercial license.
- Safarnama: Developed by Ali Tavallaie and inspired by Nasir Khusraw’s travelogue.
- Woodsman: Also developed by Ali Tavallaie, drawing inspiration from Kino’s powerful weapon.
- jinaai: Thanks for providing the readerlm-v2 model for summarizing content and extracting tags.
Enjoy your journey through the digital landscapes of python.org and wiki.python.org, and happy exploring!