A Streamlit‑based application to scrape one or more websites, parse and index their content, and interactively query the data using a local LLM (Llama 3.1 via Ollama).
- Multi‐URL support: input multiple URLs (one per line)
- Cloudflare bypass using headless Chromium (
DrissionPage) - HTML parsing & content grouping (
BeautifulSoup) - Clean text extraction & heading‑based grouping
- Embeddings generation (
sentence-transformers) - FAISS L2 indexing of content for fast retrieval
- LLM‑powered querying with subquery decomposition
- JSON export of scraped data
- URL‑wise filtering of grouped content in the UI
- app.py – Streamlit UI and control flow
- scraper.py – Page fetching, CF bypass, parsing, indexing
- cloudfare_bypasser.py – Logic to click through Cloudflare checks
- llm.py – Subquery generation & prompt orchestration with OllamaLLM
- requirements.txt – Python dependencies
- .gitignore – Files/folders to ignore
- website_data.json – (auto‑generated) scraped output
- Python 3.8+
- Google Chrome / Chromium installed
-
- Chrome Headless Shell binary (
chrome-headless-shell)
- Chrome Headless Shell binary (
-
- Place the
chrome-headless-shellbinary in the project root with its dependencies
- Place the
- Ollama runtime for Llama 3.1
- Install dependencies:
pip install -r requirements.txt
- Activate your virtual environment
- Install dependencies (
pip install -r requirements.txt) - Populate
.envas above - Launch the app:
streamlit run app.py
- In the browser:
- Step 1: Paste URLs (one per line) and click Scrape Site
- Step 2: Enter your query; enable “Analyze and break down complex queries” to auto‑split multi‑part questions
- Scraping (
scraper.py):- Launches a headless Chromium session
- Bypasses Cloudflare challenges via
CloudflareBypasser - Waits for JS content, inlines iframes, extracts cleaned HTML
- Parsing: strips navigation/footer, groups by headings (h1–h5)
- Indexing:
- Generates embeddings for titles & content
- Builds a FAISS L2 index for nearest‑neighbor search
- Querying (
llm.py):- Optionally decomposes complex queries into subqueries
- Retrieves top‐k relevant sections from FAISS
- Builds a prompt and invokes Llama 3.1 via OllamaLLM
- Displays combined or subquery responses
- Scraped content is saved as
website_data.json(overwritten each run) - You can adjust the output filename in
app.pyorscraper.py