βββ ββββββ βββββββ βββ ββββββββββββββ
ββββββββββββββββββββββ ββββββββββββββ
ββββββββββββββββββββββ ββββββββββββββ
ββ βββββββββββββββββββββββ βββββββββββββββ
βββββββββββ ββββββ βββ βββββββ βββββββββββ
ββββββ βββ ββββββ βββ βββββ βββββββββββ
Just A Rather Very Intelligent System
A local-first, multi-agent AI desktop assistant with persistent memory,
runtime self-synthesis, neural voice synthesis, biometric authentication, and a British butler attitude.
JARVIS is a personal AI assistant built to run on your own machine, sound like the one from the films, and genuinely remember who you are. It is not a chatbot wrapper. It is a complete runtime: a multi-agent orchestration system sitting on top of an async event bus, with four tiers of persistent memory, runtime self-synthesis for missing capabilities, neural TTS, biometric authentication, gesture control, and a React UI served through pywebview.
When you say something, the Orchestrator decomposes your intent into a structured task plan and dispatches work to six specialist agents β concurrently, with dependency resolution. Complex requests are pre-processed through a reasoning layer. A background proactive agent monitors your calendar and open knowledge threads, surfacing things without being asked. And a four-tier memory system means JARVIS knows your preferences, recalls past conversations semantically, and tracks the people and projects you mention across every session.
you: "open spotify, set an alarm for 7am, and what's the weather in London"
jarvis: [Orchestrator plans 3 parallel tasks]
[MediaAgent β spotify_play]
[PersonalAgent β set_alarm β persisted to SQLite]
[InfoAgent β check_weather β OpenWeather API]
"Spotify is open. Alarm set for 7 AM β I'll make sure you're conscious.
London: overcast, 11Β°C. You'll want a jacket.
Though I suspect you'll ignore that."
- Architecture
- Agent Layer
- Self-Synthesis
- Memory System
- Features
- Installation
- Configuration
- Running JARVIS
- CLI Interface
- Project Structure
- Extending JARVIS
- Voice & Personality
- Biometric Setup
- Testing
- License
Every component communicates through a single central async event bus. Nothing calls anything directly. The UI bridge, agents, skills, services, and memory system are fully decoupled β they publish and subscribe to named events.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β React Frontend β
β Dashboard Mode β Pill Mode (overlay) β
βββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββ
β PyWebView JS Bridge
βββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββ
β Async Event Bus β
β (central nervous system) β
ββββ¬βββββββββββ¬βββββββββββββββββββ¬βββββββββββββββββββββββββββββ
β β β
β ββββββββΌβββββββββββ β
β β Orchestrator β β
β β βββββββββββββ β β
β β LLM planning β β
β β Task graph β β
β β Synthesis β β
β ββββββββ¬βββββββββββ β
β β TaskQueue β
β ββββββββΌβββββββββββββββββββΌβββββββββββββββββββββββ
β β Specialist Agents β
β β InfoAgent Β· SystemAgent Β· MediaAgent β
β β CommsAgent Β· BrowserAgent Β· PersonalAgent β
β βββββββββββββββββββββ¬βββββββββββββββββββββββββββββββ
β β bus.emit(skill events)
β βββββββββββββββββββββΌβββββββββββββββββββββββββββββββ
β β Skill Layer β
β β Weather Β· News Β· Email Β· WhatsApp Β· Calendar β
β β Spotify Β· Browser Β· System Β· Volume Β· Alarm... β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Services β
β TTS (Kokoro) Β· STT Β· Wake Word Β· Biometrics Β· Gesture β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Memory System β
β Working Β· Episodic (ChromaDB) Β· Semantic Β· Entity Graph β
β + ProactiveAgent (background, 5-min cycle) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Phase 4 introduced a complete multi-agent system that replaces the flat process_user_input β LLMSkill path for all non-trivial requests.
The entry point for all user input. Uses the LLM to produce a structured JSON task plan β a dependency graph of work units, each assigned to a specialist agent. Independent tasks run in parallel via the TaskQueue. When multiple results need combining, a synthesis pass merges them into one natural response enriched with memory context.
| Agent | Handles |
|---|---|
| InfoAgent | Weather, news headlines, time, general knowledge |
| SystemAgent | App launch/close, volume, brightness, screenshots, file search, jokes |
| MediaAgent | Spotify play/pause/skip/search, YouTube, video downloads |
| CommsAgent | Email send/read, WhatsApp, Google Calendar, Google Meet |
| BrowserAgent | URL navigation, web search, tab management |
| PersonalAgent | Alarms, reminders, biometric login/register, gesture control |
Each agent extends BaseAgent (interfaces/agent.py), has its own system prompt, optional LLM access, and returns a typed {status, result, speech} dict. They emit events onto the bus β never call skills directly.
For complex inputs β multi-step requests, ambiguous queries, anything over ~12 words or containing compound signals like "and then", "schedule", "help me" β a pre-response reasoning pass runs first. Produces an internal scratchpad that informs the final response without ever being spoken. Improves quality precisely where it matters most.
Executes task plans with dependency resolution and parallelism. Independent tasks run concurrently via asyncio.gather. Tasks that depend on others wait. Failed dependencies propagate cleanly without deadlock.
A background coroutine on a 5-minute timer. Checks for upcoming calendar events and surfaces open entity threads β "You mentioned Alex was fixing the gesture bug β any update?" β without being asked.
Phase 4.5 adds a dedicated SynthesisAgent that can research, generate, validate, and hot-load a new skill when JARVIS identifies a genuine capability gap. All generated code is constrained to jarvis-generated-code/, validated by an AST scanner plus an isolated subprocess sandbox, and only activated after an explicit user confirmation step.
The generated-skill runtime is supported by:
core/code_sandbox.pyfor static and runtime safety validationcore/skill_loader.pyfor the hardcoded generated directory, hot-loading, and SQLite registry persistenceagents/synthesis_agent.pyfor the research β feasibility β codegen β validation β confirmation pipeline
Four tiers. All persistent. All integrated into every LLM call.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β WORKING MEMORY (current session) β
β Up to 40 annotated exchanges. Tracks intent, emotional β
β tone, active entities, and current task/goal state. β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β EPISODIC MEMORY (past sessions β ChromaDB) β
β Semantic vector embeddings of session summaries via β
β sentence-transformers. Retrieved by relevance to the β
β current query, not recency. Knows what you talked β
β about last week. β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β SEMANTIC MEMORY (user profile β SQLite) β
β Persistent facts: preferences, habits, work context, β
β personal details. "prefers brief answers", "works in β
β AI", "based in Kuwait", "hates mornings". β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β ENTITY GRAPH (knowledge graph β SQLite) β
β Named entities tracked across all conversations: β
β people, projects, places, organisations, concepts. β
β Stores facts, relationships, aliases, open questions. β
β "Alex works on Jarvis project. Fixed gesture bug Tues." β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Before every response: user profile + semantically relevant past episodes + entity context + session state are injected into the system prompt. After every exchange: entity extraction runs in the background. On session close: the full conversation is summarised by the LLM and saved to ChromaDB for future retrieval.
| Capability | Detail |
|---|---|
| Multi-agent orchestration | LLM-planned task graphs, parallel execution, dependency resolution |
| Multi-provider LLM | Groq (fast, cloud) β local LLaMA 3.2 3B Q4_K_M (offline fallback) |
| Reasoning layer | Pre-response scratchpad for complex/multi-step inputs |
| Compound commands | Three simultaneous intents execute in parallel |
| Self-synthesis agent | Researches, validates, and hot-loads new BaseSkill modules into jarvis-generated-code/ |
| Capability manifest | LLM knows exactly what it can and cannot do; offers alternatives |
| Context summarisation | Auto-summarises history when approaching context window limit |
| Capability | Detail |
|---|---|
| Kokoro ONNX TTS | bm_george β British male, fully offline, ~300ms CPU latency |
| pyttsx3 fallback | Auto-activates if Kokoro models are absent |
| Wake word | Always-on openWakeWord ("hey jarvis") + speech_recognition fallback |
| STT | Google Speech Recognition, ambient noise adaptation, 1s pause threshold |
| Voice switching | Change TTS voice at runtime by voice command |
| Capability | Detail |
|---|---|
| Face auth | face_recognition (dlib) for identity verification |
| LBPH fallback | OpenCV LBPH when face_recognition is unavailable |
| Gesture control | MediaPipe hand pose β pinch, fist, V-gesture map to system actions |
| Registration | One-time python scripts/register_face.py β 50 samples, auto-trains |
| Capability | Detail |
|---|---|
| Browser | Full Selenium: open URLs, search, new/close/next/prev tab |
| Spotify | Real Spotify Web API β play, pause, skip, search, volume |
| YouTube | Search and auto-play first result via Selenium |
| Video download | yt-dlp with ~/Downloads target |
| Send via Gmail SMTP, read unread via IMAP | |
pywhatkit + contacts.csv fuzzy name lookup |
|
| Google Calendar | Read events, create events, schedule Google Meet |
| System | Launch/close apps, volume (pycaw), brightness (WMI), screenshot, file search |
| Alarms | SQLite-persisted, survive restart, fire via TTS |
| Proactive | Calendar reminders, open entity threads surfaced automatically |
| Capability | Detail |
|---|---|
| Dashboard Mode | Full interface: chat, avatar panel, live system stats |
| Pill Mode | 280Γ60px overlay, always on top, mode-switchable |
| Live system stats | Real CPU / RAM / network via psutil, pushed every 3 seconds |
| ResonanceCore | Animated visual driven by AI state: idle / thinking / speaking |
| Framer Motion | Animated transitions throughout |
- Python 3.11+
- Node.js 18+ (only if rebuilding the frontend β pre-built
dist/is included) - Chrome (for browser control, YouTube, WhatsApp features)
- Microphone (for voice input)
git clone https://github.com/your-username/jarvis-main.git
cd jarvis-main
pip install -r requirements.txtRequired for entity memory NER pre-pass:
python -m spacy download en_core_web_smLocal LLM (~2GB, used as offline fallback):
python scripts/download_model.pyKokoro TTS voice (~90MB):
python scripts/download_kokoro.pycp config/.env.example config/.env
# Edit config/.env β add your API keyspython scripts/migrate_v2_to_v3.pypython scripts/register_face.pyLook at the camera for ~30 seconds. Automatically selects face_recognition if installed, LBPH otherwise.
cd frontend && npm install && npm run build && cd ..All secrets and feature flags in config/.env. Everything is optional β features degrade gracefully.
# ββ LLM ββββββββββββββββββββββββββββββββββββββββββββββββββ
LLM_MODE=groq # groq | local
LLM_API_KEY=gsk_your_groq_key # console.groq.com
LLM_API_MODEL=llama-3.3-70b-versatile
LLM_MAX_TOKENS=300
LLM_TEMPERATURE=0.05
# Optional synthesis quality lane
ANTHROPIC_API_KEY=
ANTHROPIC_MODEL=claude-opus-4-6
# ββ Weather βββββββββββββββββββββββββββββββββββββββββββββββ
OPENWEATHER_API_KEY= # openweathermap.org
DEFAULT_CITY=London
# ββ Email βββββββββββββββββββββββββββββββββββββββββββββββββ
EMAIL_ADDRESS=your@gmail.com
EMAIL_PASSWORD= # Gmail App Password, not login password
IMAP_HOST=imap.gmail.com
# ββ News ββββββββββββββββββββββββββββββββββββββββββββββββββ
NEWS_API_KEY= # newsapi.org
# ββ Spotify βββββββββββββββββββββββββββββββββββββββββββββββ
SPOTIFY_CLIENT_ID=
SPOTIFY_CLIENT_SECRET=
SPOTIFY_REDIRECT_URI=http://localhost:8888/callbackGmail App Password: Google Account β Security β 2-Step Verification β App Passwords β generate one for "Mail". Use that value, not your actual password.
Spotify: Create an app at developer.spotify.com/dashboard. Set redirect URI to
http://localhost:8888/callback. First launch opens a browser for OAuth β token cached in.cacheafterwards.
python webview_main.pypython jarvis_cli.py # full
python jarvis_cli.py --no-tts # text only, no audio
python jarvis_cli.py --debug # live event tracing
python jarvis_cli.py --no-memory # disable memory this sessionThe CLI reuses the entire backend β same agents, same skills, same memory, same event bus. Every bug found here is a real backend bug.
| Command | Description |
|---|---|
/debug |
Toggle live event tracing β see every bus event as it fires |
/memory |
Dump memory state: user profile, entities, recent exchanges |
/skills |
All registered skills and their event subscriptions |
/events |
Raw event log from this session |
/clear |
Clear terminal |
/exit |
Graceful shutdown β archives session to episodic memory first |
/help |
Command reference |
jarvis-main/
β
βββ agents/ # Multi-agent layer (Phase 4/4.5)
β βββ orchestrator.py # Plans, dispatches, synthesises
β βββ info_agent.py # Weather, news, time, knowledge
β βββ system_agent.py # Apps, volume, brightness, screenshot
β βββ media_agent.py # Spotify, YouTube, downloads
β βββ comms_agent.py # Email, WhatsApp, Calendar
β βββ browser_agent.py # URL, search, tab management
β βββ personal_agent.py # Alarms, reminders, biometrics, gestures
β βββ synthesis_agent.py # Runtime research/codegen/validation pipeline
β
βββ core/
β βββ capability_manifest.py # JARVIS capability list β injected into prompts
β βββ database.py # SQLite: alarms, history, all memory tables
β βββ engine.py # Engine lifecycle
β βββ event_bus.py # Central async event bus
β βββ llm_client.py # Groq β local LLaMA fallback, singleton
β βββ logger.py # Structured logging
β βββ proactive_agent.py # Background: calendar checks, open threads
β βββ reasoning.py # Pre-response reasoning scratchpad
β βββ code_sandbox.py # Generated skill AST + subprocess validation
β βββ skill_loader.py # Generated skill hot-loading and registry
β βββ task_queue.py # Parallel task execution + dependency resolution
β βββ memory/
β βββ manager.py # MemoryManager singleton
β βββ working.py # Session exchanges with annotations
β βββ episodic.py # ChromaDB semantic vector store
β βββ semantic.py # SQLite user fact store
β βββ procedural.py # Named workflow store
β βββ entity_store.py # Entity knowledge graph
β βββ entity_extractor.py # LLM entity extraction pipeline
β
βββ services/ # Hardware I/O only
β βββ tts.py # Kokoro ONNX + pyttsx3 fallback
β βββ stt.py # Speech recognition
β βββ biometrics.py # face_recognition / LBPH
β βββ gesture.py # MediaPipe gesture β events
β βββ wake_word.py # openWakeWord / SR fallback
β
βββ skills/ # Event-driven capability modules
β βββ llm_skill.py # Fallback brain (AGENTS_ENABLED=False)
β βββ weather_skill.py
β βββ communication.py # Email SMTP + IMAP
β βββ whatsapp_skill.py
β βββ news_skill.py
β βββ media_control.py # YouTube via Selenium
β βββ media_downloader.py # yt-dlp
β βββ browser_control.py # Selenium tab control
β βββ system.py # subprocess app launch/close
β βββ system_control.py # Volume, brightness, screenshot
β βββ quick_launch.py # Known apps/URLs
β βββ productivity.py # SQLite-persisted alarms
β βββ spotify_skill.py # Spotify Web API
β βββ calendar_skill.py # Google Calendar + Meet
β βββ web_automation.py # Google search
β
βββ interfaces/
β βββ agent.py # BaseAgent
β βββ skill.py # BaseSkill
β βββ adapter.py # BaseAdapter
β
βββ frontend/src/
β βββ components/ # AvatarPanel, ChatArea, EntityPanel, DashboardMode...
β βββ hooks/ # useJarvisBridge, useJarvisState, useChatHistory
β βββ styles/ # tokens.css, animations.css, components.css
β
βββ scripts/
β βββ setup.py # Interactive first-run wizard
β βββ download_model.py
β βββ download_kokoro.py
β βββ migrate_v2_to_v3.py # Legacy DB migration helper
β βββ register_face.py
β βββ test_all.py # Focused v3 validation suite
β βββ test_entities.py
β
βββ models/
β βββ Llama-3.2-3B-Instruct-Q4_K_M.gguf
β βββ kokoro/ # kokoro-v0_19.onnx + voices.bin
β βββ biometrics/ # face_model.yml + face_encodings.npy
β
βββ memory_store/chroma/ # ChromaDB episodic store (auto-created)
βββ jarvis-generated-code/ # Runtime-generated BaseSkill modules
βββ config/.env # Secrets β gitignored
βββ config/settings.yaml
βββ webview_main.py # Main entry point
βββ jarvis_cli.py # Terminal entry point
βββ app_config.py # Feature flags
βββ jarvis.db # SQLite (auto-created, gitignored)
# skills/my_skill.py
from interfaces.skill import BaseSkill
from core.event_bus import Event
class MySkill(BaseSkill):
def register(self):
self.bus.subscribe("my_trigger", self.handle)
async def handle(self, event: Event):
await self.bus.emit("tts_speak", "Done.")Register in webview_main.py and jarvis_cli.py.
Add the action string to the Orchestrator's ORCHESTRATOR_PROMPT, handle it in the relevant agent's handle() method, and emit the appropriate skill event.
# agents/my_agent.py
from interfaces.agent import BaseAgent
class MyAgent(BaseAgent):
NAME = "MyAgent"
async def handle(self, task: dict) -> dict:
action = task.get("action", "")
if action == "my_action":
await self.bus.emit("my_event", task.get("params", {}))
return self._ok(speech="Done.")
return self._err(f"Unknown: {action}")Register in agents/__init__.py, add to _agent_registry in webview_main.py, add to Orchestrator's available agents list.
Kokoro ONNX with bm_george β a real British male voice that runs entirely offline at ~300ms CPU latency. Automatically falls back to pyttsx3 if models are absent. Upgrade path to ElevenLabs "Brian" (nPczCjzI2devNBz1zQrb) for production-quality output.
British butler. Impeccably polite on the surface, quietly judging everything underneath. Dry wit, zero sycophancy, genuine care. Never uses "Certainly!", "Absolutely!", "Great question!". British spellings throughout. See SYSTEM_PROMPT in skills/llm_skill.py for the full definition.
The CAPABILITY_MANIFEST in core/capability_manifest.py ensures the LLM always knows exactly what it can and cannot do, and offers sensible alternatives rather than failing silently or hallucinating capabilities.
face_recognition β recommended, more accurate
pip install face_recognition # requires cmake + dlib (~10 min build)
python scripts/register_face.py
# β saves models/biometrics/face_encodings.npyOpenCV LBPH β fallback, zero extra dependencies
python scripts/register_face.py
# Auto-detects absence of face_recognition β LBPH path
# β saves models/biometrics/face_model.ymlTrigger: say "login", type login in the CLI, or press the login button in the UI.
# Focused v3 validation suite
python scripts/test_all.py
# Entity memory targeted tests
python scripts/test_entities.py
# Quick Groq connectivity check
python scripts/test_groq_conn.pyscripts/test_all.py exercises schema bootstrap, entity contradiction handling, reasoning heuristics, task-queue parallelism, synthesis safety, migration idempotency, and the Phase 5 frontend wiring. Non-zero exit code on failure.
Distributed under the POV Personal Use License v1. See LICENSE.
Free for personal use, study, and non-commercial projects with attribution. Commercial use requires written permission from the author.
"Sometimes you've gotta run before you can walk."
Built with Python Β· React Β· asyncio Β· Groq Β· Kokoro Β· ChromaDB Β· and entirely too much ambition.