๐ Complete self-hosted AI web scraping stack with ZERO external API costs!
This project combines powerful open-source tools to create a comprehensive web scraping and AI extraction system that runs entirely locally. No OpenAI, no Pinecone, no external APIs - just pure local AI power.
- Web Scraping: Extract content from any website
- AI Extraction: Structure data using local LLMs
- Semantic Search: Find similar content across scraped data
- Embedding Storage: Auto-generate and store vectors
- MCP Integration: Works with Claude Code and AI assistants
# 1. Clone this repository
git clone https://github.com/Maheidem/zero-cost-ai-scraper.git
cd zero-cost-ai-scraper
# 2. Copy environment template
cp .env.example .env
# 3. Start everything
docker-compose up -d
# 4. Install Ollama models (takes 15-30 minutes)
./scripts/install-models.sh
# 5. Verify setup
./scripts/health-check.shThat's it! Your zero-cost AI scraper is ready.
- Firecrawl: Web scraping engine (builds from source with patches)
- Ollama: Local LLM processing (qwen3-coder:30b, nomic-embed-text)
- Qdrant: Vector database for embeddings
- SearXNG: Privacy-focused web search
- MCP Server: 6 tools for AI assistant integration
- โ Zero External API Costs
- โ Privacy-First (everything runs locally)
- โ MCP Compatible (works with Claude Code)
- โ Auto-Embeddings (builds searchable knowledge base)
- โ Semantic Search (find similar content)
- โ One-Command Setup
| Operation | Time | Cost |
|---|---|---|
| AI Extraction | 15-20s | $0.00 |
| Web Scraping | 2-5s | $0.00 |
| Embedding Generation | 1-2s | $0.00 |
| Similarity Search | <500ms | $0.00 |
| Total Monthly Cost | - | $0.00 |
Compare this to cloud solutions that cost $50-200/month!
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ AI Assistant โโโโโถโ MCP Server โโโโโถโ Firecrawl โ
โ (Claude Code) โ โ (6 tools) โ โ (Web Scraping) โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโ
โ โ โ
โ โโโโโโโโโโโโโโโโ โ
โ โ SearXNG โโโโโโโโโโโโโโโโ
โ โ (Web Search) โ
โ โโโโโโโโโโโโโโโโ
โผ โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ Ollama โ โ Qdrant โ
โ qwen3-coder:30b โ โ (Vector DB) โ
โ nomic-embed-textโ โ โ
โโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
import requests
# Scrape and auto-store embeddings
response = requests.post("http://localhost:3002/v1/scrape", json={
"url": "https://python.org/about",
"formats": ["markdown"]
})# Extract structured data using local AI
response = requests.post("http://localhost:3002/v1/extract", json={
"urls": ["https://news.ycombinator.com"],
"prompt": "Extract top stories and their scores",
"schema": {
"stories": {
"type": "array",
"items": {
"title": "string",
"score": "number",
"url": "string"
}
}
}
})
# Notice: "llmUsage": 0 (local processing!)# Find similar content across all scraped data
response = requests.post("http://localhost:3002/v1/similarity-search", json={
"query": "machine learning tutorials",
"limit": 5
})// Use any of the 6 MCP tools:
// - firecrawl_scrape
// - firecrawl_extract
// - firecrawl_similarity_search
// - firecrawl_search
// - firecrawl_crawl
// - firecrawl_map
// Example: Extract data from Reddit
mcp__firecrawl-local__firecrawl_extract({
"urls": ["https://reddit.com/r/programming"],
"prompt": "Extract trending programming topics",
"schema": {"topics": ["string"], "engagement": "number"}
})zero-cost-ai-scraper/
โโโ mcp-server/ # MCP server (our code, MIT license)
โโโ searxng/ # Search engine configuration
โโโ configs/ # Service configurations
โโโ scripts/ # Setup and maintenance scripts
โโโ patches/ # Firecrawl modification patches
โโโ examples/ # Usage examples
โโโ docker-compose.yml # Complete stack definition
โโโ docs/ # Comprehensive documentation
-
Original Projects (we build upon):
-
AI Assistants:
- Claude Code (via MCP)
- Any MCP-compatible tool
- Direct API integration
- Research Automation: Scrape academic papers, documentation, tutorials
- Content Aggregation: Build knowledge bases from web sources
- Competitive Intelligence: Monitor competitor websites and news
- Documentation Extraction: Pull API docs, guides, examples
- News Monitoring: Track industry news and developments
- No Data Leaves Your Machine: Everything runs locally
- No User Tracking: SearXNG doesn't store search queries
- Open Source: All code is auditable
- Self-Hosted: You control your data
- Docker & Docker Compose: For containerization
- 8GB+ RAM: For running Ollama models
- 50GB+ Disk: For model storage
- Internet: For initial setup and model download
This project builds on amazing open-source work:
- Firecrawl by Sideguide Technologies Inc. (AGPL v3)
- SearXNG by SearXNG team (AGPL v3)
- Our integrations are MIT licensed
See CONTRIBUTING.md for guidelines.
- Our Code: MIT License (MCP server, configurations, scripts)
- Dependencies: Various (see individual project licenses)
- Patches: AGPL v3 compatible modifications
If this helps you save money on AI APIs, please โญ star this repository!
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Documentation: See docs/ directory
Built with โค๏ธ for the open-source community
Making enterprise-level AI accessible to everyone, at zero cost.