Zero-Cost AI Web Scraper

🚀 Complete self-hosted AI web scraping stack with ZERO external API costs!

This project combines powerful open-source tools to create a comprehensive web scraping and AI extraction system that runs entirely locally. No OpenAI, no Pinecone, no external APIs - just pure local AI power.

🎯 What This Does

Web Scraping: Extract content from any website
AI Extraction: Structure data using local LLMs
Semantic Search: Find similar content across scraped data
Embedding Storage: Auto-generate and store vectors
MCP Integration: Works with Claude Code and AI assistants

⚡ Quick Start

# 1. Clone this repository
git clone https://github.com/Maheidem/zero-cost-ai-scraper.git
cd zero-cost-ai-scraper

# 2. Copy environment template
cp .env.example .env

# 3. Start everything
docker-compose up -d

# 4. Install Ollama models (takes 15-30 minutes)
./scripts/install-models.sh

# 5. Verify setup
./scripts/health-check.sh

That's it! Your zero-cost AI scraper is ready.

🛠 What's Included

Core Services

Firecrawl: Web scraping engine (builds from source with patches)
Ollama: Local LLM processing (qwen3-coder:30b, nomic-embed-text)
Qdrant: Vector database for embeddings
SearXNG: Privacy-focused web search
MCP Server: 6 tools for AI assistant integration

Key Features

✅ Zero External API Costs
✅ Privacy-First (everything runs locally)
✅ MCP Compatible (works with Claude Code)
✅ Auto-Embeddings (builds searchable knowledge base)
✅ Semantic Search (find similar content)
✅ One-Command Setup

📊 Performance

Operation	Time	Cost
AI Extraction	15-20s	$0.00
Web Scraping	2-5s	$0.00
Embedding Generation	1-2s	$0.00
Similarity Search	<500ms	$0.00
Total Monthly Cost	-	$0.00

Compare this to cloud solutions that cost $50-200/month!

🔧 Architecture

┌─────────────────┐    ┌──────────────┐    ┌─────────────────┐
│   AI Assistant  │───▶│  MCP Server  │───▶│   Firecrawl     │
│   (Claude Code) │    │  (6 tools)   │    │ (Web Scraping)  │
└─────────────────┘    └──────────────┘    └─────────────────┘
         │                      │                    │
         │              ┌──────────────┐              │
         │              │   SearXNG    │◀─────────────┘
         │              │ (Web Search) │
         │              └──────────────┘
         ▼                      │
┌─────────────────┐    ┌──────────────┐
│     Ollama      │    │    Qdrant    │
│ qwen3-coder:30b │    │ (Vector DB)  │
│ nomic-embed-text│    │              │
└─────────────────┘    └──────────────┘

🎮 Usage Examples

Web Scraping with Auto-Embedding

import requests

# Scrape and auto-store embeddings
response = requests.post("http://localhost:3002/v1/scrape", json={
    "url": "https://python.org/about",
    "formats": ["markdown"]
})

AI Extraction (Zero Cost)

# Extract structured data using local AI
response = requests.post("http://localhost:3002/v1/extract", json={
    "urls": ["https://news.ycombinator.com"],
    "prompt": "Extract top stories and their scores",
    "schema": {
        "stories": {
            "type": "array",
            "items": {
                "title": "string",
                "score": "number",
                "url": "string"
            }
        }
    }
})

# Notice: "llmUsage": 0 (local processing!)

Semantic Search

# Find similar content across all scraped data
response = requests.post("http://localhost:3002/v1/similarity-search", json={
    "query": "machine learning tutorials",
    "limit": 5
})

MCP Integration (Claude Code)

// Use any of the 6 MCP tools:
// - firecrawl_scrape
// - firecrawl_extract
// - firecrawl_similarity_search
// - firecrawl_search
// - firecrawl_crawl
// - firecrawl_map

// Example: Extract data from Reddit
mcp__firecrawl-local__firecrawl_extract({
  "urls": ["https://reddit.com/r/programming"],
  "prompt": "Extract trending programming topics",
  "schema": {"topics": ["string"], "engagement": "number"}
})

📁 Project Structure

zero-cost-ai-scraper/
├── mcp-server/              # MCP server (our code, MIT license)
├── searxng/                 # Search engine configuration
├── configs/                 # Service configurations
├── scripts/                 # Setup and maintenance scripts
├── patches/                 # Firecrawl modification patches
├── examples/                # Usage examples
├── docker-compose.yml       # Complete stack definition
└── docs/                    # Comprehensive documentation

🔗 Works With

Original Projects (we build upon):
- Firecrawl - Web scraping engine
- SearXNG - Meta search engine
- Ollama - Local LLM runtime
- Qdrant - Vector database
AI Assistants:
- Claude Code (via MCP)
- Any MCP-compatible tool
- Direct API integration

💡 Use Cases

Research Automation: Scrape academic papers, documentation, tutorials
Content Aggregation: Build knowledge bases from web sources
Competitive Intelligence: Monitor competitor websites and news
Documentation Extraction: Pull API docs, guides, examples
News Monitoring: Track industry news and developments

🛡 Privacy & Security

No Data Leaves Your Machine: Everything runs locally
No User Tracking: SearXNG doesn't store search queries
Open Source: All code is auditable
Self-Hosted: You control your data

📋 Requirements

Docker & Docker Compose: For containerization
8GB+ RAM: For running Ollama models
50GB+ Disk: For model storage
Internet: For initial setup and model download

🤝 Contributing

This project builds on amazing open-source work:

Firecrawl by Sideguide Technologies Inc. (AGPL v3)
SearXNG by SearXNG team (AGPL v3)
Our integrations are MIT licensed

See CONTRIBUTING.md for guidelines.

📄 License

Our Code: MIT License (MCP server, configurations, scripts)
Dependencies: Various (see individual project licenses)
Patches: AGPL v3 compatible modifications

🌟 Star This Project

If this helps you save money on AI APIs, please ⭐ star this repository!

📞 Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Documentation: See docs/ directory

Built with ❤️ for the open-source community

Making enterprise-level AI accessible to everyone, at zero cost.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs		docs
examples		examples
firecrawl-build		firecrawl-build
mcp-server		mcp-server
patches		patches
scripts		scripts
searxng-config		searxng-config
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
NOTICE.md		NOTICE.md
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Zero-Cost AI Web Scraper

🎯 What This Does

⚡ Quick Start

🛠 What's Included

Core Services

Key Features

📊 Performance

🔧 Architecture

🎮 Usage Examples

Web Scraping with Auto-Embedding

AI Extraction (Zero Cost)

Semantic Search

MCP Integration (Claude Code)

📁 Project Structure

🔗 Works With

💡 Use Cases

🛡 Privacy & Security

📋 Requirements

🤝 Contributing

📄 License

🌟 Star This Project

📞 Support

About

Uh oh!

Releases

Packages

Languages

Maheidem/zero-cost-ai-scraper

Folders and files

Latest commit

History

Repository files navigation

Zero-Cost AI Web Scraper

🎯 What This Does

⚡ Quick Start

🛠 What's Included

Core Services

Key Features

📊 Performance

🔧 Architecture

🎮 Usage Examples

Web Scraping with Auto-Embedding

AI Extraction (Zero Cost)

Semantic Search

MCP Integration (Claude Code)

📁 Project Structure

🔗 Works With

💡 Use Cases

🛡 Privacy & Security

📋 Requirements

🤝 Contributing

📄 License

🌟 Star This Project

📞 Support

About

Topics

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages