IMPORTANT NOTE: This project is a test project that I have created using Copilot Workspace. I am using the project for testing with other AI-collaboration platforms, RAG / Index it and feed it to AI agents and simply having fun with it.
The Web Scraping Sandbox is a modular and extensible web scraping framework designed to simplify the process of extracting data from websites. It provides a set of tools and utilities to facilitate web scraping tasks, including making HTTP requests, parsing HTML content, and handling various web scraping scenarios.
- Python 3.11 or higher
- Docker (optional, for containerized setup)
-
Clone the repository:
git clone https://github.com/githubnext/web-scraping-sandbox.git cd web-scraping-sandbox
-
Create a virtual environment and activate it:
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install the dependencies:
pip install -r requirements.txt
-
Build the Docker image:
docker build -t web-scraping-sandbox .
-
Run the Docker container:
docker run -it --rm web-scraping-sandbox
-
Navigate to the
web-scraping-sandbox/ui
directory:cd web-scraping-sandbox/ui
-
Install the dependencies:
npm install
-
Start the React development server:
npm start
-
Open your browser and navigate to
http://localhost:3000
to access the UI.
-
Build the Docker image:
docker build -t web-scraping-sandbox .
-
Run the Docker container:
docker run -it --rm -p 3000:3000 -p 5000:5000 web-scraping-sandbox
-
Open your browser and navigate to
http://localhost:3000
to access the UI.
To run the unit tests using pytest, execute the following command:
pytest
Here's an example of how to use the web scraper:
-
Create a Python script (e.g.,
example.py
) with the following content:from SRC.scraper import Scraper url = "https://example.com" scraper = Scraper(url) data = scraper.scrape() print(data)
-
Run the script:
python example.py