Autonomous AI agent infrastructure on bare metal — local 122B inference, hybrid memory, voice pipeline, and 30+ cron jobs running 24/7 for $0.
Hard to kill, impossible to ignore.
| Metric | Value |
|---|---|
| 🖥️ Total VRAM | 224 GB across 3 GPUs (96 + 96 + 32) |
| 🧠 Main Model | Qwen 3.5 122B-A10B MoE — 131K context, zero API cost |
| 🔄 Autonomous Jobs | 30+ cron tasks running 24/7 on local inference |
| 💾 Memory Vectors | 96,000+ embeddings in Qdrant |
| 🕸️ Knowledge Graph | 240,000+ nodes in FalkorDB |
| 🔍 Search Pipeline | Vector + BM25 sparse + graph + cross-encoder reranker |
| 🗣️ Voice Pipeline | Whisper STT → LLM → Qwen3 TTS (real-time, local) |
| 🔀 Proxy Versions | 11 iterations evolved over 6 months |
| 💰 Monthly Inference | $0 for local models |
╔══════════════════════════════════════════════════════════════════════════════╗
║ RASPUTIN STACK ║
╠══════════════════════════════════════════════════════════════════════════════╣
║ ║
║ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ║
║ │ Telegram │ │ Discord │ │ Voice │ │ Browser │ │Dashboard │ ║
║ │ Bot │ │ Bot │ │ WebRTC │ │ Relay │ │ (Web) │ ║
║ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ ║
║ │ │ │ │ │ ║
║ └──────────────┴──────┬───────┴──────────────┴──────────────┘ ║
║ │ ║
║ ┌────────────▼────────────┐ ║
║ │ OpenClaw Gateway │ ║
║ │ Sessions · Sub-Agents │ ║
║ │ Crons · Tools · Safety │ ║
║ └────────────┬────────────┘ ║
║ │ ║
║ ┌──────────────────▼──────────────────┐ ║
║ │ cartu-proxy (v11) │ ║
║ │ Session Affinity · Quality Gate │ ║
║ │ Cost Logging · Rate Limiting │ ║
║ └──┬──────────┬───────────┬──────┬────┘ ║
║ │ │ │ │ ║
║ ┌────────▼──┐ ┌─────▼────┐ ┌───▼───┐ ┌▼──────────┐ ║
║ │ Local GPU │ │ Zen/Free │ │ OAuth │ │ Direct API│ ║
║ │ Qwen 3.5 │ │ Opus 4 │ │Claude │ │ Gemini/etc│ ║
║ │ 122B MoE │ │ (Free) │ │ │ │ │ ║
║ └───────────┘ └──────────┘ └───────┘ └───────────┘ ║
║ ║
║ ┌─────────────────── MEMORY LAYER ───────────────────────┐ ║
║ │ │ ║
║ │ ┌───────────┐ ┌───────────┐ ┌──────┐ ┌─────────┐ │ ║
║ │ │ Qdrant │ │ FalkorDB │ │ BM25 │ │Reranker │ │ ║
║ │ │ 96K+ vecs │ │ 240K nodes│ │Sparse│ │ bge-v2 │ │ ║
║ │ └───────────┘ └───────────┘ └──────┘ └─────────┘ │ ║
║ └────────────────────────────────────────────────────────┘ ║
║ ║
║ ┌────────────────── VOICE PIPELINE ──────────────────────┐ ║
║ │ Whisper STT ──► LLM Reasoning ──► Qwen3 TTS ──► Audio│ ║
║ └────────────────────────────────────────────────────────┘ ║
║ ║
║ ┌───────────────── AUTONOMOUS LAYER ─────────────────────┐ ║
║ │ 30+ Cron Jobs: Fact Extraction · Memory Enrichment │ ║
║ │ Research Scanning · Anomaly Detection · Episode │ ║
║ │ Detection · Health Monitoring · Brain Cleanup │ ║
║ └────────────────────────────────────────────────────────┘ ║
║ ║
╠══════════════════════════════════════════════════════════════════════════════╣
║ HARDWARE ║
║ ║
║ GPU 0: RTX PRO 6000 Blackwell (96GB) ── Qwen 3.5 122B MoE ║
║ GPU 1: RTX PRO 6000 Blackwell (96GB) ── Coder 30B · Embeddings · Rerank ║
║ GPU 2: RTX 5090 (32GB) ── TTS · Auxiliary Inference ║
║ CPU: Xeon w9-3495X (56C/112T) · 251GB DDR5 · Arch Linux ║
║ ║
╚══════════════════════════════════════════════════════════════════════════════╝
| Directory | Description | Files |
|---|---|---|
proxy/ |
LLM routing proxy — 11 versions, multi-provider failover, streaming, cost tracking | 15 |
dashboard/ |
Full web dashboard — sessions, playground, council, memory heatmap, cost forecasting | 224 |
ui/ |
React 19 frontend — 272+ components, shadcn/ui, Monaco editor, 7-language i18n | 263 |
backend/ |
Express API — JWT RBAC, PostgreSQL, 30+ routes, WebSocket streaming | 132 |
voice/ |
Voice pipeline — Qwen3 TTS server, Whisper STT, WebRTC, voice cloning | 127 |
memory/ |
Hybrid memory — Qdrant vectors, BM25 sparse, FalkorDB graph, reranker | 8 |
tools/ |
Agent tools — AI Council, browser automation, RAG, memory ops, benchmarks | 16 |
crons/ |
Autonomous cron jobs — fact extraction, enrichment, research, anomaly detection | — |
cli/ |
CLI interface — chat, search, session management, consensus, verification | 24 |
browser/ |
Chrome extension — content injection, Manifest V3, message routing | 4 |
council/ |
Multi-model debate — structured consensus, swarm protocol | 3 |
selfplay/ |
Self-play pipeline — task generation, solving, evaluation | 4 |
research/ |
Research tools — AI model scanner, YouTube monitoring, multi-engine search | 4 |
monitoring/ |
Infrastructure monitoring — anomaly detection, health checks, forecasting | 4 |
method/ |
Compaction research — academic paper, benchmark suite, triad framework | 26 |
doctor/ |
Diagnostics — system health checks, alerting | 3 |
desktop/ |
Electron wrapper — desktop application | 6 |
config/ |
Configuration templates and examples | — |
docs/ |
Documentation — architecture, memory design, deployment, API reference | 9 |
Multi-provider routing proxy with 5-tier failover chain:
Local Qwen 122B ($0) → Zen/Free Opus → OAuth Claude → Per-Token APIs → Direct Anthropic
- Session affinity and quality gating
- Adaptive thinking budget management
- SSE streaming with tool-calling normalization
- Per-provider rate limiting and cost logging
- 11 versions documenting the full evolution
Four-layer retrieval pipeline:
- Dense vectors — nomic-embed-text via Ollama → Qdrant (96K+ vectors)
- Sparse vectors — BM25 for keyword precision
- Knowledge graph — FalkorDB (240K+ nodes) for relationship traversal
- Cross-encoder reranker — bge-reranker-v2-m3 for final ranking
Multi-angle query expansion generates 5+ search queries per request. Sub-500ms end-to-end.
Audio In → Whisper STT → LLM Reasoning → Qwen3 TTS → Audio Out
- OpenAI-compatible TTS API with multiple backends (PyTorch, OpenVINO, vLLM)
- Voice cloning and design
- WebRTC for real-time communication
- Streaming audio generation
30+ scheduled jobs running on $0 local inference:
- Fact extraction — Mines conversations for persistent facts
- Memory enrichment — Cross-references and links related memories
- Episode detection — Identifies narrative arcs across sessions
- Research scanning — Monitors AI frontier developments
- Anomaly detection — Flags metric deviations from day-of-week baselines
- Brain cleanup — Deduplicates and maintains memory health
- Health monitoring — Infrastructure and service health checks
Full-featured web UI for monitoring and control:
- Real-time session viewer with WebSocket streaming
- Multi-model playground with side-by-side comparison
- AI Council — multi-model debate engine
- Memory heatmap visualization
- Cost tracking and forecasting
- Loop detection and anomaly alerting
| Layer | Technology |
|---|---|
| Frontend | Next.js 14, React 19, TypeScript, shadcn/ui, Tailwind CSS, Framer Motion |
| Backend | Node.js, Express, PostgreSQL, JWT RBAC, WebSocket |
| Inference | Ollama, llama.cpp — Qwen 3.5 122B MoE, Qwen3 Coder 30B |
| Memory | Qdrant (dense + sparse), FalkorDB (graph), BM25, bge-reranker-v2-m3 |
| Voice | Qwen3-TTS, faster-whisper, WebRTC, Pipecat |
| Proxy | Python / aiohttp, SSE streaming, multi-provider routing |
| Search | Hybrid: dense vectors + BM25 sparse + graph traversal + reranker |
| Infra | PM2, Docker, systemd, Arch Linux, NVMe SSD |
| Component | Spec | Role |
|---|---|---|
| GPU 0 | NVIDIA RTX PRO 6000 Blackwell (96 GB) | Qwen 3.5 122B-A10B MoE inference |
| GPU 1 | NVIDIA RTX PRO 6000 Blackwell (96 GB) | Qwen Coder 30B, embeddings, reranker |
| GPU 2 | NVIDIA RTX 5090 (32 GB) | TTS, auxiliary inference |
| CPU | Intel Xeon w9-3495X (56 cores / 112 threads) | Services, embeddings, orchestration |
| RAM | 251 GB DDR5 | — |
| OS | Arch Linux | — |
| Total VRAM | 224 GB | — |
# Start all services
pm2 start ecosystem.config.js
# Individual components
cd proxy && python proxy_v11.py # LLM routing proxy
cd backend && node src/index.js # API server
cd dashboard && node server.js # Web dashboard
cd voice/qwen3-tts-server && python -m api.main # TTS server- Local-first — 122B MoE model on bare metal, $0/month inference cost
- Multi-provider failover — Flat-rate → free tier → per-token, automatic
- Hybrid memory — Dense + sparse + graph + reranker, every query
- Autonomous by default — 30+ crons handle maintenance, research, and enrichment without human input
- Observable — Full cost logging, session audit trails, anomaly detection
MIT — See LICENSE