Trace collection and cache-hit analysis workspace for memory-augmented agent workloads.
This repository compares prompt-cache behavior across:
mem0graph memorygraphititemporal graph memorytau2-benchconversational workloads (airline,retail,telecom)LoCoMolong-term memory benchmark (multi-session dialogues)
The core question is:
- "When prompt prefixes are unstable, how much can substring/block caching recover?"
Primary focus:
mem0 (graph)vsgraphititau2domains are supporting baselines, not the main comparison target.LoCoMofills a gap that independent-item datasets cannot cover (see below).
End-to-end pipeline:
- Run agent workloads and intercept every LLM call.
- Normalize calls into a shared JSONL trace format.
- Run
lmcache-agent-trace/prefix_analysis.py. - Compare
prefixvssubstringhit rates and their gap.
Matrix extension:
- Run
dataset x baselineexperiments withopenai_base,mem0,graphiti. - Datasets include
corpus50,tau2_*,taubench_legacy, andlocomo. - In matrix mode,
tau2_*andtaubench_legacyare replayed as prompt-text datasets (not full tau2 environment simulation loops). locomoruns per-conversation (graph resets between conversations, accumulates within).
Data flow:
collector -> trace jsonl -> prefix_analysis -> matches jsonl + plot -> comparison chart
| Layer | Role | Key files |
|---|---|---|
| Collector | Executes workload and captures LLM calls | src/trace_collector/*_collector.py |
| Normalizer | Writes unified trace schema | src/trace_collector/common.py |
| Analyzer | Runs LMCache analysis script | src/trace_collector/analyze.py |
| Aggregator | Builds cross-system chart | src/trace_collector/compare_chart.py |
| Analysis engine | Prefix/substring hit computation | lmcache-agent-trace/prefix_analysis.py |
Interception strategies by system:
| System | How calls are intercepted | Output trace |
|---|---|---|
mem0 |
OpenAI response_callback |
data/traces/mem0_graph/mem0_graph_session.jsonl |
graphiti |
OpenAIGenericClient subclass override |
data/traces/graphiti_graph/graphiti_graph_session.jsonl |
tau2 |
litellm.completion monkeypatch |
data/traces/tau2_<domain>/tau2_<domain>_session.jsonl |
If you want to analyze the core framework behavior, start here first:
mem0andgraphitionly collection- Per-system prefix/substring result comparison
- Gap interpretation at architecture level
Fast path commands:
uv run python -m src.trace_collector.run_all --system mem0
uv run python -m src.trace_collector.run_all --system graphiti
uv run python -m src.trace_collector.analyze --system mem0
uv run python -m src.trace_collector.analyze --system graphitiMatrix mode commands:
# 1) Collect matrix traces (all datasets x all baselines)
uv run python -m src.trace_collector.run_matrix --dataset all --baseline all --with-breakdown
# 2) Analyze matrix traces
uv run python -m src.trace_collector.analyze_matrix --dataset all --baseline all
# 3) Build markdown report
uv run python -m src.trace_collector.matrix_report -o docs/matrix_breakdown.mdCore outputs:
data/traces/mem0_graph/mem0_graph_session.jsonldata/traces/graphiti_graph/graphiti_graph_session.jsonldata/traces/mem0_result/mem0_matches.jsonldata/traces/graphiti_result/graphiti_matches.jsonl
Interpretation guide:
mem0high prefix + small gap -> stable scaffold/template reusegraphitilower prefix + larger gap -> dynamic context injection with reusable moved blocks
All datasets feed into the same pipeline (collector → trace → prefix_analysis), but they stress different dimensions of prompt-cache behavior.
Independent-item datasets (no cross-item continuity):
| Dataset | Items | Avg size | What it measures |
|---|---|---|---|
corpus50 |
50 factual statements | 141 chars | System-prompt reuse under short, independent entity extraction |
tau2_airline |
50 CS tasks | ~1.6k tokens | Domain-specific tool-calling prompt patterns |
tau2_retail |
50 CS tasks | ~2.3k tokens | Same as airline, different domain vocabulary |
tau2_telecom |
50 CS tasks | ~850 tokens | Same, telecom domain |
taubench_legacy |
471 raw LLM calls | 663 chars | Legacy baseline replay (no live agent loop) |
These datasets share a common limitation: items are independent, so graph memory accumulation effects are weak. The graph grows, but each item doesn't reference or build upon a previous item's context.
Sequential multi-session dataset (cross-session continuity):
| Dataset | Items | Avg size | What it measures |
|---|---|---|---|
locomo |
272 sessions (10 conversations × ~27 sessions) | 2,893 chars | Long-term memory accumulation across months of dialogue |
LoCoMo fills a gap that independent-item datasets cannot cover:
-
Memory accumulation over time: Within each conversation, sessions span months. Graph memory grows across sessions — later sessions can reference facts from earlier ones. This is the real use case for graph memory systems.
-
Prompt size inflation: For graphiti, the
<PREVIOUS_MESSAGES>episode history grows with each session. This means prefix cache hit rates should degrade over time within a conversation — a pattern invisible in independent-item datasets. -
Control group comparison:
openai_agents(no memory system) should show no cross-session effect, confirming that observed patterns are memory-driven.
The per-conversation collection strategy ensures graph state resets between different people's conversations while accumulating within the same conversation:
conv_0 (Caroline & Melanie, 19 sessions, May→Oct 2023)
→ mem0/graphiti graph builds up across 19 sessions
→ graph resets
conv_1 (Jon & Gina, 19 sessions, Jan→Jul 2023)
→ fresh graph, builds up across 19 sessions
→ ...
Expected observations:
- graphiti: prefix hit rate drops session-over-session; substring gap widens
- mem0: entity/relation graph enriches; substring reuse increases with graph density
- openai_agents: flat across sessions (no memory = no accumulation)
src/trace_collector/
common.py # env resolution, test corpus, TraceLogger
datasets.py # dataset loaders for matrix experiments
neo4j_metrics.py # workload breakdown logger + Neo4j query instrumentation
run_all.py # collector orchestrator
run_matrix.py # dataset x baseline trace orchestrator
mem0_collector.py # mem0 trace collection
graphiti_collector.py# graphiti trace collection
openai_base_collector.py # direct OpenAI baseline collection
tau2_collector.py # tau2 trace collection
analyze.py # wrapper around prefix_analysis.py
analyze_matrix.py # matrix trace analyzer
matrix_report.py # matrix markdown report generator
compare_chart.py # cross-system comparison chart
paper_figures.py # publication LaTeX tables + PDF figures
data/traces/
*/*.jsonl # raw traces
*/*_breakdown.jsonl # workload breakdown events (prompt, cypher, snapshots)
*_result/*.jsonl # substring match logs
*_result/*.png # per-system hit-rate plots
comparison_chart.png # combined chart
data/locomo/
locomo10.json # LoCoMo benchmark (10 conversations, snap-research/locomo)
docs/paper/
table*.tex # LaTeX tables (auto-generated by paper_figures.py)
fig*.pdf # PDF figures (auto-generated by paper_figures.py)
lmcache-agent-trace/
prefix_analysis.py # core algorithm (tokenize + prefix/substring scoring)
Breakdown db_snapshot stages:
before_collection: full DB snapshot before a runafter_step: lightweight per-item snapshot (stepfield included)after_collection: full DB snapshot after a run
- Python
>=3.10 uv- Docker + Docker Compose (for DozerDB/Neo4j)
- LLM API key and endpoint
- Install dependencies:
uv sync --dev- Prepare Neo4j directories/plugins and start DB:
./setup.sh
docker-compose up -d- Create isolated databases:
CREATE DATABASE mem0store;
CREATE DATABASE graphitistore;- Configure
.env:
OPENAI_API_KEY=...
# Optional compatibility settings
GPU_API_KEY=...
GPU_ENDPOINT=https://api.openai.com/v1
GPU_MODEL=gpt-4o-mini
# Preferred runtime overrides
LLM_API_KEY=...
LLM_API_BASE=https://api.openai.com/v1
LLM_MODEL=gpt-4o-mini
NEO4J_URI=bolt://localhost:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=passwordEnvironment variable resolution:
LLM_API_KEY->GPU_API_KEY->OPENAI_API_KEYLLM_API_BASE->GPU_ENDPOINT->OPENAI_BASE_URL-> OpenAI defaultLLM_MODEL->GPU_MODEL
- Verify endpoint capability (chat, tool calling, JSON mode):
uv run python -m src.trace_collector.test_endpoint- Collect traces:
uv run python -m src.trace_collector.run_all --system all- Analyze hit rates:
uv run python -m src.trace_collector.analyze --system all- Build cross-system chart:
uv run python -m src.trace_collector.compare_chartFor core analysis only (recommended first pass):
uv run python -m src.trace_collector.run_all --system mem0
uv run python -m src.trace_collector.run_all --system graphiti
uv run python -m src.trace_collector.analyze --system mem0
uv run python -m src.trace_collector.analyze --system graphitiFor dataset x baseline matrix:
uv run python -m src.trace_collector.run_matrix --dataset all --baseline all --with-breakdown
uv run python -m src.trace_collector.analyze_matrix --dataset all --baseline all
uv run python -m src.trace_collector.matrix_report -o docs/matrix_breakdown.mdUse this order to break down results like a MemGPT-style report.
- Topline by system
- Open
data/traces/comparison_chart.png. - Compare
prefix,substring, andgap = substring - prefix.
- Inspect per-system raw traces
- Check prompt construction behavior in raw JSONL:
input: what changes turn-to-turnoutput: tool calls / structured responses
- Inspect substring match logs
- Read
*_matches.jsonlfor:InputLenMatches[](MatchStart,MatchEnd,PrevStep,PrevMatchStart,PrevMatchEnd)
- Classify each system pattern
- High prefix, low gap: stable prompt prefixes.
- Low prefix, high substring: dynamic insertion/reordering with reusable blocks.
- Low both: little cross-call reuse.
For mem0 vs graphiti, prioritize these questions:
- Where does prefix break earliest?
- Which repeated blocks are recovered only by substring?
- Is the gap driven by context mutation, retrieval position shift, or tool output pattern?
- Write conclusions at two levels
- Architecture level: why that system creates this shape.
- Operations level: expected impact on prefill latency and compute reuse.
If your goal is framework-level understanding, read in this order:
src/trace_collector/run_all.pysrc/trace_collector/mem0_collector.pysrc/trace_collector/graphiti_collector.pysrc/trace_collector/tau2_collector.pysrc/trace_collector/common.pysrc/trace_collector/analyze.pylmcache-agent-trace/prefix_analysis.pysrc/trace_collector/compare_chart.py
Run project tests only:
uv run pytest -qpytest ignores reference/generated directories (codebase, vendor, data, lmcache-agent-trace) and collects from tests/.
- Treat runtime analysis outputs as generated artifacts:
data/traces/*_result/data/traces/*/*.jsonl.bak*
- Keep intentional fixtures only.
- Prefer committing scripts/config that regenerate outputs over large result blobs.