Private, self-hosted Email RAG: turn your own mail archive into a queryable knowledge base that runs on your hardware, on open models, with nothing required to leave your network. A faithful, private record of what you've written and received — an AI memory you actually own, one half of a context stack you own. (The other half is parley, for calls and meetings.)
What it buys you. On a real ~32k-email corporate mailbox, stacking the retrieval techniques takes recall@5 from 46% (plain dense) to 93% — and the two biggest levers aren't a fancier model: a per-email contextual summary (+13) and reconstructing the whole thread instead of hunting one message (+29). As a yardstick the email-tuned hybrid is benchmarked against NVIDIA's general-purpose retrieval stack: it wins on email, while NVIDIA's stack wins on broad legal e-discovery (TREC) — same systems, opposite winners, task-dependent. Numbers are author-reported on a private mailbox (cross-checked on public Enron-QA, same ordering); the 93% is thread-level recall (see caveats) and the public
make demoreproduces the method. Full write-up and reproducible scripts in the case study below.
The first time I pointed cloud AI at my inbox it felt like a superpower — until I thought about what it actually required: handing my entire email history to someone else's servers to make it searchable. For real correspondence — contracts, receipts, the record of who agreed to what — that's a non-starter.
So I built the opposite. mailrag runs on your own hardware, on open models, with nothing required to leave your network — no mailbox upload, no vendor to trust with the whole archive.
Then the real point clicked. These aren't just emails, they're context. A faithful, private record of what was said and written is exactly what an AI agent needs to be useful about your work — kept private and self-owned, so you get total recall without renting your memory to anyone. mailrag is one private context source, for email; parley is another, for calls and meetings — different domain, different machinery (on-device audio + diarization). They don't talk to each other; my agents know about both and reach for whatever fits. The point was never a single app — it's a private, open stack of context I own.
mailrag turns a mailbox into a queryable knowledge base, built on LlamaIndex:
- Pluggable loaders — public Enron corpus (HuggingFace), local
.emlarchives, or Azure Blob Storage, behind oneEmailLoaderinterface. - Email-aware preprocessing — reply-chain stripping, calendar-invite collapsing, noise/newsletter filtering, exact-text chunk dedup.
- Thread-aware answers (the flagship) — match a single small unit, then answer from its entire conversation. It roughly doubles answer coverage (terse replies 33% → ~80%), it's the biggest single retrieval win, and it needs no LLM.
- Hybrid retrieval — bge-m3 dense + sparse vectors (RRF-fused) in Qdrant (also supports local persistence and Pinecone). Gets both the concept and the rare exact token — acronyms, IDs, reference numbers.
- Local-LLM
summarize— optional per-email summary + noise judgement from a local LLM, content-addressed and cached, so re-runs are free. - A measured methodology — a 360-query retrieval eval that prices each technique, controls for confounds, reports significance, and in several cases overturned the intuitive choice. The core techniques were later confirmed against a public, human-judged benchmark.
- Source-agnostic API —
load_emails(source="enron"|"mail_archive_x"|"azure_blob").
git clone https://github.com/fmasi/mailrag.git
cd mailrag
pip install -r requirements.txt # includes FlagEmbedding (bge-m3); first run downloads ~2 GB of weights
cp .env.example .env # add an LLM key/endpoint (used for summaries + answers)
make demo # starts Qdrant, builds the contextual index, runs thread-aware queriesmake demo brings up Qdrant (Docker), builds a thread-aware contextual index over 100 Enron
emails — per-email preceding-context summaries embedded with bge-m3 hybrid vectors — then
answers example questions by retrieving and assembling whole threads. This is the
§13 stack;
a small amount of LLM usage goes to the Pass-2 summaries and the answers.
┌─────────────────────────────┐
sources │ EmailLoader (ABC) │
┌─────────┐ ├──────────┬─────────┬─────────┤
│ Enron │──────────│ enron │ mail_ │ azure │
│ .eml │ │ │ archive │ blob │
│ Azure │ └────┬─────┴────┬────┴────┬────┘
└─────────┘ │ NormalizedEmail │
▼ ▼
Pass-1: regex noise filter — tag bulk/newsletters (no LLM)
▼
Pass-2: local LLM — summarize + judge noise, cached (optional)
▼
drop noise · dedup · reply-chain strip
└ drop stage is tunable: Pass-1 = save LLM budget · Pass-2 = best quality
▼
chunk (SentenceSplitter, bge-m3 tokenizer)
└ optional: prepend each email's summary → contextual retrieval
▼
embed (bge-m3 dense + sparse)
▼
┌───────────────────┼────────────────────┐
▼ ▼ ▼
local persist Qdrant (hybrid) Pinecone
▼
query engine (hybrid RRF · thread-aware expansion · optional rerank)
Pass-1 only tags by default, so nothing is lost before the LLM sees it; the confident drop happens at Pass-2. Where you drop is a deliberate budget-vs-quality knob — drop at Pass-1 to skip the LLM cost, or at Pass-2 for the cleaner result.
Between the two there's an optional, no-LLM triage: mailrag explore clusters the
corpus embeddings at thread level and ranks the densest "noise pockets" (bulk and
automated mail) by pass-1-tag enrichment, sender concentration, and tightness. It spends
no LLM budget — it reuses the already-embedded vectors when a collection exists, else
embeds once — and writes a JSON artifact (thread → .eml paths) so you can see where the
noise concentrates before deciding how much Pass-2 to run.
Most numbers below come from running
mailragon a real ~32,000-email corporate mailbox (all references anonymized); the portability check at the end uses a second, personal archive. They're here so the repo doubles as a worked example — why each step exists, what it saves, and what it costs.
The corpus is filtered in stages before anything gets embedded:
| stage | what it does | effect on this corpus |
|---|---|---|
| Scope | keep only the work-account folders | 70,016 exported → 31,969 selected |
| Pass-1 (regex) | cheap sender/subject rules drop obvious bulk (newsletters, social, automated senders) before any expensive work | flags 10.4% (3,332) |
| Pass-2 (local LLM) | summarize + judge each email's content | flags 37.9% (12,123) as noise |
| Calendar-collapse + chunk-dedup | one-line calendar summaries; drop byte-identical chunks | 22,613 → 21,590 chunks (−1,023) |
| Net | 31,969 emails → 19,859 kept → 21,590 embedded chunks |
How much of this actually needed an LLM? We measured it. Regex rules derived from the corpus (high-noise sender domains, calendar/out-of-office subject patterns) catch ~65% of the LLM's noise at high precision, but miss ~35% (≈4,200 emails). The miss is structural: the work domain itself is 29% noise — 24k emails interleaving real correspondence with compliance reminders, calendar churn, AMAs, and internal newsletters — and you can't write a sender rule for your own domain. That ~35% is the LLM's unique contribution. Two qualifiers keep it honest:
- Rule discovery didn't need a full pass. The dominant noise senders (LinkedIn, Zoom, SharePoint, …) fall straight out of a sender-frequency table; a small sample finds the rules without the 32k run.
- The 48 h → under-10-min embedding win was the inference method (FlagEmbedding on Apple-Silicon MPS) plus volume reduction, not the LLM.
So the local-LLM pass earns its keep twice: the ~35% mixed-domain noise cheap rules can't reach, and the per-email summaries that power the retrieval gains below (contextual retrieval, reranking) and make results human-readable. The lesson: use cheap regex for the obvious bulk, reserve the LLM for the interleaved noise and the summaries only it can write.
And the rubric itself doesn't port across corpora. Run the same pipeline over a ~25,000-email personal archive and the point makes itself: the corporate noise rubric flagged 87.6% of personal mail as noise — it would have deleted real receipts, bank statements, and correspondence — while a rubric calibrated for the personal corpus flagged 61.5%. A cheap ~200-email calibration caught the gap before the ~6 h run, and a spot-check of the dropped pile confirmed it, all on a local model with no cloud spend. Full write-up: EXPERIMENTS §14.
| technique | what it adds | trade-off (observed) |
|---|---|---|
| Dense (semantic) only | matches meaning & paraphrase | misses rare exact tokens (acronyms, IDs); returns redundant near-duplicate chunks |
| + learned sparse + RRF fusion (bge-m3) | exact-token / acronym precision, fused with semantics | needs a sparse-capable embedder + fusion; more storage |
| + LLM noise removal | precision — catches the ~⅓ of noise regex can't, and clears junk out of the top results (measured below) | one-time LLM cost (see above) |
+ contextual retrieval (prepend each email's summary before embedding — the C′ / work-rag-ctx-* collection) |
short/terse emails match by gist; the best ranked arm and the end-to-end winner | one extra embedded collection to build/maintain |
| + cross-encoder reranker | small precision lift on pointed queries (+2.5 R@5) | demotes the answer on thread-spanning queries (and hurt outright under the earlier LLM-judged eval, §9); off by default |
| + thread reconstruction (pull the full conversation of each top hit) | recall@5 62% → 93% — match a small unit, answer from its whole thread | larger context per query (tunable: expand top-N threads) |
How the eval was run. The eval set is 360 synthetic queries (144 terse / 144 content /
72 spanning), each generated from a known email so the recall ladder is scored against hard
gold labels — no LLM judge in the loop. A separate answer-quality lens does use a local LLM
judge, calibrated against a stronger reference model (Cohen's κ = 0.52 on the 0–3 scale,
0.80 binary at the relevance threshold actually used; Spearman 0.74). The core techniques
were cross-checked on the TREC Legal Track's real human judgments and on public Enron-QA,
which agreed on ordering. Significance tests and confound controls are in
EXPERIMENTS.md §9–§13:
- Thread reconstruction is the biggest single win — and needs no LLM. Matching a small unit and returning its whole conversation lifts recall@5 from 62% → 93% (+29) — it trades "find the needle" for "find the right thread," which the conversation then answers.
- Thread-aware summaries help where they're designed to — terse replies. (Note: "thread-aware" names two things — the retrieval expansion above, and this summary-conditioning step; see the terminology box.) Conditioning each email's embedded summary on its preceding thread context lifts terse-reply retrieval from covered@3 75% → 81% (p = 0.035). The corpus-wide effect is real but modest (+3pp), and we report it as such.
- A confound caught and reported. An early +6pp headline turned out to be half a
quantization artifact; re-running the control at matched quant split it into +3pp (quant)
- +3pp (method). Holding the summarizer fixed is the difference between a result and a mirage.
- Cleanup pays in precision, not recall. Leaving the noise a regex can't catch barely dents gold recall (the DB still finds the answer), but then 21% of queries surface noise in their top-3 (~11% of slots) — junk the LLM removes for free in the pass that also writes the summary.
- Reranking helps pointed questions but hurts thread-spanning ones. A cross-encoder reranker adds only +2.5 recall@5 overall and demotes the answer on multi-email questions (no single message looks like the whole answer) — and it hurt outright under the earlier LLM-judged answer-quality eval. Query-side HyDE never beat the raw query on this entity-rich corpus. Both stay in-tree, off by default, for corpora where they'd pay off.
- The ceiling is retrieval, not the model. With the answer in context, even a 4 B model answered ~88% correctly; the lost points are queries where retrieval never surfaced the thread. Model size was second-order.
The compound effect — the canonical recall@5 ladder. Each technique added one at a time,
scored on the 360 queries against hard gold labels (no LLM judge), reproducible via
scripts/eval/bench_avc.py + bench_thread_reconstruction.py:
| step | recall@5 | gain |
|---|---|---|
| plain dense | 46% | — |
| + learned sparse | 49% | +3 |
| + contextual summary | 62% | +13 |
| + reranking | 64% | +2 |
| + thread reconstruction ★ | 93% | +29 |
★ The last step switches from "find the exact email" to "find its thread" — a legitimately easier, more useful target (thread-recall). The two biggest levers (thread reconstruction +29, contextual summary +13) are both about understanding the conversation, not a fancier embedding model. Same ordering on public Enron-QA; the NVIDIA dense+rerank yardstick trails on email (57% R@5 vs the hybrid's 62%) but wins on TREC legal e-discovery — task-fit, not brand. The value isn't any single trick; it's the disciplined stack and the rigor to prove every layer.
Worked example. Searching for the salon partner programme by its acronym ("SPP")
mixes a semantic concept (partnership onboarding) with a rare exact token (SPP).
Dense-only finds the concept but ranks the literal acronym low; sparse-only finds the token
but misses paraphrases; hybrid + RRF gets both. Multi-query expansion (searching several
phrasings and fusing with RRF) further bridges acronym ↔ expansion ("SPP" ↔ "Salon Partner
Programme"), at the cost of extra queries per search.
| Path | Responsibility |
|---|---|
src/config/ |
Configuration + LlamaIndex Settings |
src/data/ |
NormalizedEmail model, multi-source load_emails API |
src/data/loaders/ |
Pluggable source loaders (enron, mail_archive_x, azure_blob) |
src/ingest/ |
Embedding (bge-m3), sparse vectors, hybrid Qdrant upsert |
src/indexing/ |
Index creation/management |
src/storage/ |
Persistence (local / Pinecone / Qdrant) |
src/query/ |
Retrieval + RAG query engine |
src/llm/ |
Optional LLM "Pass-2" summarization + cache |
scripts/ |
Build / index / maintenance utilities |
tests/ |
Test suite (pytest) |
docs/ |
Architecture, quickstart, preprocessing guides |
Full map and reading order: docs/INDEX.md. The reader journey is
this page → quickstart → setup → deep dives:
- You are here (
README.md) — overview, quickstart, and the case study. docs/GUIDE.md— the friendly walkthrough: the cleanup funnel diagram, how to pick a persona (budget vs quality), and what thewizardlooks like when you run it.docs/QUICKSTART.md— 5-minute setup and copy-paste usage patterns.docs/SETUP.md— full setup, the local.emlpipeline, and how to run the tests.- Deep dives:
docs/VERBS.md— the verb ladder (cost of each step) and the persona recipes; the source of truth for the CLI.docs/ARCHITECTURE.md— design decisions & extension points.docs/EMAIL_PREPROCESSING.md— reply-chain stripping & chunk tuning.docs/RETRIEVAL_GUIDE.md— the retrieval stack end-to-end: hybrid fusion, contextual retrieval, reranking, and thread-aware retrieval (small→big expansion).docs/EXPERIMENTS.md— the measured findings behind the case study: cleanup economics, regex-vs-LLM, the labeled-eval ladder (§9–§13), and the corpus-portability result (§14). Start with its terminology box for theC/C′labels and the two senses of "thread-aware".
Reference: config/community_blocklist.template.yaml — portable starter noise rules (~1/3 of corporate-mail noise, corpus-independent).
mailrag is built to be one node in a private context stack — so the next steps make it easier for agents to reach, and keep its memory current:
- MCP server (#32) — expose
search/askand attachment fetch over the Model Context Protocol, so any agent can query your mail without touching the internals. - Live ingestion — move from one-time imports to incremental ingest of incoming mail, so
the index stays current: a living context source, not a static snapshot. (The
EmailLoaderinterface is already source-agnostic to make this clean.) - Guided TUI (#36) — a full-screen terminal UI for the cleanup pipeline (pick a persona, watch the funnel, approve the calibrate gate), replacing today's prompt-by-prompt flow.
I build private, self-hosted context tools for AI agents — software that gives an agent (and me) total recall over my own work without renting my memory to a vendor. mailrag covers email; parley covers calls and meetings.
I care about retrieval quality you can actually measure, email and information-retrieval systems, and engineering claims backed by numbers and honest caveats. If that's useful to you, or you're hiring, I'd like to hear from you.
- LinkedIn — https://www.linkedin.com/in/fmasi/
- GitHub — https://github.com/fmasi
Apache 2.0 — see also NOTICE. Copyright © 2026 Frederic Masi.
If you build on this work (code or method), please preserve the attribution in NOTICE.