Skip to content

๐Ÿš€ Complete self-hosted AI web scraping stack with ZERO external API costs! Features local Ollama, Qdrant vector search, and MCP integration.

Notifications You must be signed in to change notification settings

Maheidem/zero-cost-ai-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

1 Commit
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Zero-Cost AI Web Scraper

๐Ÿš€ Complete self-hosted AI web scraping stack with ZERO external API costs!

This project combines powerful open-source tools to create a comprehensive web scraping and AI extraction system that runs entirely locally. No OpenAI, no Pinecone, no external APIs - just pure local AI power.

๐ŸŽฏ What This Does

  • Web Scraping: Extract content from any website
  • AI Extraction: Structure data using local LLMs
  • Semantic Search: Find similar content across scraped data
  • Embedding Storage: Auto-generate and store vectors
  • MCP Integration: Works with Claude Code and AI assistants

โšก Quick Start

# 1. Clone this repository
git clone https://github.com/Maheidem/zero-cost-ai-scraper.git
cd zero-cost-ai-scraper

# 2. Copy environment template
cp .env.example .env

# 3. Start everything
docker-compose up -d

# 4. Install Ollama models (takes 15-30 minutes)
./scripts/install-models.sh

# 5. Verify setup
./scripts/health-check.sh

That's it! Your zero-cost AI scraper is ready.

๐Ÿ›  What's Included

Core Services

  • Firecrawl: Web scraping engine (builds from source with patches)
  • Ollama: Local LLM processing (qwen3-coder:30b, nomic-embed-text)
  • Qdrant: Vector database for embeddings
  • SearXNG: Privacy-focused web search
  • MCP Server: 6 tools for AI assistant integration

Key Features

  • โœ… Zero External API Costs
  • โœ… Privacy-First (everything runs locally)
  • โœ… MCP Compatible (works with Claude Code)
  • โœ… Auto-Embeddings (builds searchable knowledge base)
  • โœ… Semantic Search (find similar content)
  • โœ… One-Command Setup

๐Ÿ“Š Performance

Operation Time Cost
AI Extraction 15-20s $0.00
Web Scraping 2-5s $0.00
Embedding Generation 1-2s $0.00
Similarity Search <500ms $0.00
Total Monthly Cost - $0.00

Compare this to cloud solutions that cost $50-200/month!

๐Ÿ”ง Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   AI Assistant  โ”‚โ”€โ”€โ”€โ–ถโ”‚  MCP Server  โ”‚โ”€โ”€โ”€โ–ถโ”‚   Firecrawl     โ”‚
โ”‚   (Claude Code) โ”‚    โ”‚  (6 tools)   โ”‚    โ”‚ (Web Scraping)  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚                      โ”‚                    โ”‚
         โ”‚              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”              โ”‚
         โ”‚              โ”‚   SearXNG    โ”‚โ—€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚              โ”‚ (Web Search) โ”‚
         โ”‚              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ–ผ                      โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚     Ollama      โ”‚    โ”‚    Qdrant    โ”‚
โ”‚ qwen3-coder:30b โ”‚    โ”‚ (Vector DB)  โ”‚
โ”‚ nomic-embed-textโ”‚    โ”‚              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐ŸŽฎ Usage Examples

Web Scraping with Auto-Embedding

import requests

# Scrape and auto-store embeddings
response = requests.post("http://localhost:3002/v1/scrape", json={
    "url": "https://python.org/about",
    "formats": ["markdown"]
})

AI Extraction (Zero Cost)

# Extract structured data using local AI
response = requests.post("http://localhost:3002/v1/extract", json={
    "urls": ["https://news.ycombinator.com"],
    "prompt": "Extract top stories and their scores",
    "schema": {
        "stories": {
            "type": "array",
            "items": {
                "title": "string",
                "score": "number",
                "url": "string"
            }
        }
    }
})

# Notice: "llmUsage": 0 (local processing!)

Semantic Search

# Find similar content across all scraped data
response = requests.post("http://localhost:3002/v1/similarity-search", json={
    "query": "machine learning tutorials",
    "limit": 5
})

MCP Integration (Claude Code)

// Use any of the 6 MCP tools:
// - firecrawl_scrape
// - firecrawl_extract
// - firecrawl_similarity_search
// - firecrawl_search
// - firecrawl_crawl
// - firecrawl_map

// Example: Extract data from Reddit
mcp__firecrawl-local__firecrawl_extract({
  "urls": ["https://reddit.com/r/programming"],
  "prompt": "Extract trending programming topics",
  "schema": {"topics": ["string"], "engagement": "number"}
})

๐Ÿ“ Project Structure

zero-cost-ai-scraper/
โ”œโ”€โ”€ mcp-server/              # MCP server (our code, MIT license)
โ”œโ”€โ”€ searxng/                 # Search engine configuration
โ”œโ”€โ”€ configs/                 # Service configurations
โ”œโ”€โ”€ scripts/                 # Setup and maintenance scripts
โ”œโ”€โ”€ patches/                 # Firecrawl modification patches
โ”œโ”€โ”€ examples/                # Usage examples
โ”œโ”€โ”€ docker-compose.yml       # Complete stack definition
โ””โ”€โ”€ docs/                    # Comprehensive documentation

๐Ÿ”— Works With

  • Original Projects (we build upon):

  • AI Assistants:

    • Claude Code (via MCP)
    • Any MCP-compatible tool
    • Direct API integration

๐Ÿ’ก Use Cases

  1. Research Automation: Scrape academic papers, documentation, tutorials
  2. Content Aggregation: Build knowledge bases from web sources
  3. Competitive Intelligence: Monitor competitor websites and news
  4. Documentation Extraction: Pull API docs, guides, examples
  5. News Monitoring: Track industry news and developments

๐Ÿ›ก Privacy & Security

  • No Data Leaves Your Machine: Everything runs locally
  • No User Tracking: SearXNG doesn't store search queries
  • Open Source: All code is auditable
  • Self-Hosted: You control your data

๐Ÿ“‹ Requirements

  • Docker & Docker Compose: For containerization
  • 8GB+ RAM: For running Ollama models
  • 50GB+ Disk: For model storage
  • Internet: For initial setup and model download

๐Ÿค Contributing

This project builds on amazing open-source work:

  • Firecrawl by Sideguide Technologies Inc. (AGPL v3)
  • SearXNG by SearXNG team (AGPL v3)
  • Our integrations are MIT licensed

See CONTRIBUTING.md for guidelines.

๐Ÿ“„ License

  • Our Code: MIT License (MCP server, configurations, scripts)
  • Dependencies: Various (see individual project licenses)
  • Patches: AGPL v3 compatible modifications

๐ŸŒŸ Star This Project

If this helps you save money on AI APIs, please โญ star this repository!

๐Ÿ“ž Support


Built with โค๏ธ for the open-source community

Making enterprise-level AI accessible to everyone, at zero cost.

About

๐Ÿš€ Complete self-hosted AI web scraping stack with ZERO external API costs! Features local Ollama, Qdrant vector search, and MCP integration.

Topics

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published