Skip to content

tehpug/python_org_data

Repository files navigation

Python.org & Wiki.python.org Adventure

This repository embarks on an expedition to discover and explore content from two major Python-related websites: python.org and wiki.python.org. We leverage two powerful tools to achieve this:

  • Safarnama:
    Inspired by Nasir Khusraw’s timeless travelogue, Safarnama is a web crawling tool that journeys through websites, cleans up HTML content, and uses a language model to generate summaries and extract key tags. It works both from the command line and programmatically in Python.

  • Woodsman:
    Named after Kino’s formidable weapon, Woodsman is a generic SQL database viewer built with Streamlit. It lets you navigate and conquer your SQL databases by reflecting schemas, interactively exploring table data, filtering results, and exporting data in various formats.

We also extend our thanks to jinaai for providing the readerlm-v2 model. This model is used to summarize content and extract tags. Please note that the model is licensed under a CC Non-Commercial license.


Repository Structure

.
├── README.md
├── config_python.yaml       # Configuration for crawling python.org
├── config_wiki.yaml         # Configuration for crawling wiki.python.org (using depth 4)
├── data
│   ├── python_org.db        # SQLite DB generated by Safarnama (crawling python.org)
│   ├── python_org.log       # Log file for python.org crawl
│   ├── python_org.md        # Markdown file generated from the DB by Woodsman; includes detailed information (URLs, summaries, tags) for python.org (depth 2)
│   ├── sitemap_python.xml   # Sitemap generated for python.org
│   ├── wiki.db              # SQLite DB generated by Safarnama (crawling wiki.python.org)
│   ├── wiki.log             # Log file for wiki.python.org crawl
│   └── wiki.md              # Markdown file in progress for wiki.python.org (depth 4 crawl & analysis)
├── hello.py                 # Python script that loads configurations and runs the crawlers sequentially
├── pyproject.toml           # Project configuration for dependency management
└── uv.lock                  # Lock file for uv package management

How It Works

  1. Crawling with Safarnama (Python 3.10+):

    • python.org Crawl:
      The crawler uses config_python.yaml with a depth setting of 2. Results are stored in the SQLite database data/python_org.db and are subsequently processed to generate data/python_org.md. This markdown file contains detailed information—URLs, summaries, and tags—extracted from the database. Users can later use Woodsman to filter and refine this data, tailoring it to their specific needs.

    • wiki.python.org Crawl:
      The crawler for wiki.python.org is configured via config_wiki.yaml and uses a deeper crawl (depth 4). Data from this crawl is stored in data/wiki.db and logged to data/wiki.log. The output is being consolidated into data/wiki.md as the crawl progresses and analysis is refined.

  2. Exploring Data with Woodsman:

    • Woodsman lets you navigate the data stored in the SQLite databases (e.g., python_org.db and wiki.db). You can interactively filter the detailed information—such as URLs, summaries, and tags—provided in the markdown files (e.g., data/python_org.md) and export your filtered, targeted version in formats like JSON, CSV, or Markdown.
  3. Running the Code:

    • The hello.py script demonstrates how to load the configurations and run the Safarnama crawlers sequentially. It first processes python.org and then continues with wiki.python.org.

Getting Started

Prerequisites

  • Python 3.10+
  • Install Safarnama and Woodsman via pip, Poetry, or your preferred package manager.

For example, using pip:

pip install safarnama woodsman

Running the Crawls

  1. Update the Configuration:

    • Ensure config_python.yaml points to https://python.org.
    • Ensure config_wiki.yaml points to https://wiki.python.org and set max_depth: 4.
  2. Execute the Script:

    Run the hello.py script to start the crawling process:

    python hello.py

    This script will:

    • Crawl python.org and store results in the data folder.
    • Crawl wiki.python.org with a deeper exploration (depth 4) while processing the data as the crawl progresses.
  3. Explore with Woodsman:

    Launch Woodsman to interactively filter and navigate the crawled data:

    woodsman

    Use Woodsman to load your SQLite databases (e.g., python_org.db or wiki.db), apply filters, and export a targeted version of your data.


Contributing

Contributions to this project, as well as to Safarnama and Woodsman, are welcome. Please refer to their respective repositories for contribution guidelines:

Feel free to fork this repository, make improvements, and submit a pull request.


License

This project is licensed under the MIT License. See the LICENSE file for details.
The readerlm-v2 model is provided by jinaai under a CC Non-Commercial license.


Credits

  • Safarnama: Developed by Ali Tavallaie and inspired by Nasir Khusraw’s travelogue.
  • Woodsman: Also developed by Ali Tavallaie, drawing inspiration from Kino’s powerful weapon.
  • jinaai: Thanks for providing the readerlm-v2 model for summarizing content and extracting tags.

Enjoy your journey through the digital landscapes of python.org and wiki.python.org, and happy exploring!

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages