##---Use https://github.com/noanchovies/scraper-engine instead ---
---This version was cool but not good enough. scraper-engine is faster, plug&play, "more better" ---
Keeping old version for log and learning purposes
A generic base template project for building web scrapers using Python, Selenium, BeautifulSoup, and Typer. Designed to be easily copied and adapted for various scraping targets.
Note: For quickly briefing AI assistants (like Google Gemini) on this template's structure and how to adapt it, refer to the AI_CONTEXT.txt file in the project root. It includes a summary and an adaptation checklist for next steps.
- Selenium WebDriver: Uses Selenium with
webdriver-managerfor automated browser control, capable of handling dynamic JavaScript-heavy websites. Configurable headless mode. - HTML Parsing: Integrates BeautifulSoup for parsing HTML structure obtained via Selenium.
- Configurable: Easily configure target URLs, output filenames, wait times, and headless mode via
src/basescraper/config.py,.envfiles, or command-line arguments. - CLI Interface: Uses Typer to provide a clean command-line interface for running the scraper.
- Modular Structure: Separates concerns into configuration (
config.py), core scraping logic (scraper.py), and CLI (cli.py) within a standardsrclayout. - Placeholder Implementation: Core data extraction (
extract_data) and data handling (handle_data) functions are provided as clear placeholders (NotImplementedError) that must be implemented for each specific scraping project. - Structured Output (Example): Includes an optional pattern for saving data to CSV using Pandas (
save_to_csvfunction commented out withinhandle_dataplaceholder).
- Language: Python 3.8+
- Browser Automation: Selenium
- Driver Management: webdriver-manager
- HTML Parsing: BeautifulSoup4
- Data Handling (Example): Pandas (for CSV saving pattern)
- CLI: Typer, Rich
- Configuration: python-dotenv
- Packaging: setuptools, pyproject.toml
- Copy Template: Create a new project by copying this entire
base-scraper-pydirectory. - Navigate:
cdinto your new project directory. - Create Virtual Environment:
python -m venv venv - Activate Environment:
- Windows:
.\venv\Scripts\activate - macOS/Linux:
source venv/bin/activate
- Windows:
- Install Dependencies:
pip install -r requirements.txt - (Optional) Git Init: If desired, delete the copied
.gitfolder, rungit init, create a new remote repository, and link it (git remote add origin <url>).
- Implement Logic: Follow the detailed steps in
HOW_TO_USE_BASE.txtto implement the requiredextract_dataandhandle_datafunctions withinsrc/basescraper/scraper.pyfor your specific target website. - Configure: Set your target URL and other parameters in
.envorsrc/basescraper/config.py. - Run from CLI:
- Option A (Run as module):
python -m src.basescraper.cli run [OPTIONS] - Option B (If installed editable
pip install -e .):basescraper run [OPTIONS]
- Option A (Run as module):
- CLI Options: Use
--helpto see available options:Example:python -m src.basescraper.cli run --help # or basescraper run --helpbasescraper run --url "your-target-url.com" -o "my_output.csv" --no-headless
The core adaptation steps are detailed in HOW_TO_USE_BASE.txt. Primarily involves implementing:
extract_data(page_source): Add logic using BeautifulSoup selectors to parse the HTML (page_source) from your target site and return a list of dictionaries.handle_data(data, output_target): Add logic to process the list of dictionaries returned byextract_data(e.g., save to CSV, database, API).
MIT License (Update LICENSE file and pyproject.toml if using a different license).
(Add details here if you plan for others to contribute or how to contact you).