lmcache-contributor

Trace collection and cache-hit analysis workspace for memory-augmented agent workloads.

This repository compares prompt-cache behavior across:

mem0 graph memory
graphiti temporal graph memory
tau2-bench conversational workloads (airline, retail, telecom)
LoCoMo long-term memory benchmark (multi-session dialogues)

The core question is:

"When prompt prefixes are unstable, how much can substring/block caching recover?"

Primary focus:

mem0 (graph) vs graphiti
tau2 domains are supporting baselines, not the main comparison target.
LoCoMo fills a gap that independent-item datasets cannot cover (see below).

What This Framework Does

End-to-end pipeline:

Run agent workloads and intercept every LLM call.
Normalize calls into a shared JSONL trace format.
Run lmcache-agent-trace/prefix_analysis.py.
Compare prefix vs substring hit rates and their gap.

Matrix extension:

Run dataset x baseline experiments with openai_base, mem0, graphiti.
Datasets include corpus50, tau2_*, taubench_legacy, and locomo.
In matrix mode, tau2_* and taubench_legacy are replayed as prompt-text datasets (not full tau2 environment simulation loops).
locomo runs per-conversation (graph resets between conversations, accumulates within).

Data flow:

collector -> trace jsonl -> prefix_analysis -> matches jsonl + plot -> comparison chart

Framework Components

Layer	Role	Key files
Collector	Executes workload and captures LLM calls	`src/trace_collector/*_collector.py`
Normalizer	Writes unified trace schema	`src/trace_collector/common.py`
Analyzer	Runs LMCache analysis script	`src/trace_collector/analyze.py`
Aggregator	Builds cross-system chart	`src/trace_collector/compare_chart.py`
Analysis engine	Prefix/substring hit computation	`lmcache-agent-trace/prefix_analysis.py`

Interception strategies by system:

System	How calls are intercepted	Output trace
`mem0`	OpenAI `response_callback`	`data/traces/mem0_graph/mem0_graph_session.jsonl`
`graphiti`	`OpenAIGenericClient` subclass override	`data/traces/graphiti_graph/graphiti_graph_session.jsonl`
`tau2`	`litellm.completion` monkeypatch	`data/traces/tau2_<domain>/tau2_<domain>_session.jsonl`

Main Comparison: mem0 (graph) vs graphiti

If you want to analyze the core framework behavior, start here first:

mem0 and graphiti only collection
Per-system prefix/substring result comparison
Gap interpretation at architecture level

Fast path commands:

uv run python -m src.trace_collector.run_all --system mem0
uv run python -m src.trace_collector.run_all --system graphiti

uv run python -m src.trace_collector.analyze --system mem0
uv run python -m src.trace_collector.analyze --system graphiti

Matrix mode commands:

# 1) Collect matrix traces (all datasets x all baselines)
uv run python -m src.trace_collector.run_matrix --dataset all --baseline all --with-breakdown

# 2) Analyze matrix traces
uv run python -m src.trace_collector.analyze_matrix --dataset all --baseline all

# 3) Build markdown report
uv run python -m src.trace_collector.matrix_report -o docs/matrix_breakdown.md

Core outputs:

data/traces/mem0_graph/mem0_graph_session.jsonl
data/traces/graphiti_graph/graphiti_graph_session.jsonl
data/traces/mem0_result/mem0_matches.jsonl
data/traces/graphiti_result/graphiti_matches.jsonl

Interpretation guide:

mem0 high prefix + small gap -> stable scaffold/template reuse
graphiti lower prefix + larger gap -> dynamic context injection with reusable moved blocks

Dataset Design: What Each Workload Tests

All datasets feed into the same pipeline (collector → trace → prefix_analysis), but they stress different dimensions of prompt-cache behavior.

Independent-item datasets (no cross-item continuity):

Dataset	Items	Avg size	What it measures
`corpus50`	50 factual statements	141 chars	System-prompt reuse under short, independent entity extraction
`tau2_airline`	50 CS tasks	~1.6k tokens	Domain-specific tool-calling prompt patterns
`tau2_retail`	50 CS tasks	~2.3k tokens	Same as airline, different domain vocabulary
`tau2_telecom`	50 CS tasks	~850 tokens	Same, telecom domain
`taubench_legacy`	471 raw LLM calls	663 chars	Legacy baseline replay (no live agent loop)

These datasets share a common limitation: items are independent, so graph memory accumulation effects are weak. The graph grows, but each item doesn't reference or build upon a previous item's context.

Sequential multi-session dataset (cross-session continuity):

Dataset	Items	Avg size	What it measures
`locomo`	272 sessions (10 conversations × ~27 sessions)	2,893 chars	Long-term memory accumulation across months of dialogue

LoCoMo fills a gap that independent-item datasets cannot cover:

Memory accumulation over time: Within each conversation, sessions span months. Graph memory grows across sessions — later sessions can reference facts from earlier ones. This is the real use case for graph memory systems.
Prompt size inflation: For graphiti, the <PREVIOUS_MESSAGES> episode history grows with each session. This means prefix cache hit rates should degrade over time within a conversation — a pattern invisible in independent-item datasets.
Control group comparison: openai_agents (no memory system) should show no cross-session effect, confirming that observed patterns are memory-driven.

The per-conversation collection strategy ensures graph state resets between different people's conversations while accumulating within the same conversation:

conv_0 (Caroline & Melanie, 19 sessions, May→Oct 2023)
  → mem0/graphiti graph builds up across 19 sessions
  → graph resets
conv_1 (Jon & Gina, 19 sessions, Jan→Jul 2023)
  → fresh graph, builds up across 19 sessions
  → ...

Expected observations:

graphiti: prefix hit rate drops session-over-session; substring gap widens
mem0: entity/relation graph enriches; substring reuse increases with graph density
openai_agents: flat across sessions (no memory = no accumulation)

Repository Layout

src/trace_collector/
  common.py            # env resolution, test corpus, TraceLogger
  datasets.py          # dataset loaders for matrix experiments
  neo4j_metrics.py     # workload breakdown logger + Neo4j query instrumentation
  run_all.py           # collector orchestrator
  run_matrix.py        # dataset x baseline trace orchestrator
  mem0_collector.py    # mem0 trace collection
  graphiti_collector.py# graphiti trace collection
  openai_base_collector.py # direct OpenAI baseline collection
  tau2_collector.py    # tau2 trace collection
  analyze.py           # wrapper around prefix_analysis.py
  analyze_matrix.py    # matrix trace analyzer
  matrix_report.py     # matrix markdown report generator
  compare_chart.py     # cross-system comparison chart
  paper_figures.py     # publication LaTeX tables + PDF figures

data/traces/
  */*.jsonl            # raw traces
  */*_breakdown.jsonl  # workload breakdown events (prompt, cypher, snapshots)
  *_result/*.jsonl     # substring match logs
  *_result/*.png       # per-system hit-rate plots
  comparison_chart.png # combined chart

data/locomo/
  locomo10.json        # LoCoMo benchmark (10 conversations, snap-research/locomo)

docs/paper/
  table*.tex           # LaTeX tables (auto-generated by paper_figures.py)
  fig*.pdf             # PDF figures (auto-generated by paper_figures.py)

lmcache-agent-trace/
  prefix_analysis.py   # core algorithm (tokenize + prefix/substring scoring)

Breakdown db_snapshot stages:

before_collection: full DB snapshot before a run
after_step: lightweight per-item snapshot (step field included)
after_collection: full DB snapshot after a run

Requirements

Python >=3.10
uv
Docker + Docker Compose (for DozerDB/Neo4j)
LLM API key and endpoint

Setup

Install dependencies:

uv sync --dev

Prepare Neo4j directories/plugins and start DB:

./setup.sh
docker-compose up -d

Create isolated databases:

CREATE DATABASE mem0store;
CREATE DATABASE graphitistore;

Configure .env:

OPENAI_API_KEY=...

# Optional compatibility settings
GPU_API_KEY=...
GPU_ENDPOINT=https://api.openai.com/v1
GPU_MODEL=gpt-4o-mini

# Preferred runtime overrides
LLM_API_KEY=...
LLM_API_BASE=https://api.openai.com/v1
LLM_MODEL=gpt-4o-mini

NEO4J_URI=bolt://localhost:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=password

Environment variable resolution:

LLM_API_KEY -> GPU_API_KEY -> OPENAI_API_KEY
LLM_API_BASE -> GPU_ENDPOINT -> OPENAI_BASE_URL -> OpenAI default
LLM_MODEL -> GPU_MODEL

Run Workflow

Verify endpoint capability (chat, tool calling, JSON mode):

uv run python -m src.trace_collector.test_endpoint

Collect traces:

uv run python -m src.trace_collector.run_all --system all

Analyze hit rates:

uv run python -m src.trace_collector.analyze --system all

Build cross-system chart:

uv run python -m src.trace_collector.compare_chart

For core analysis only (recommended first pass):

uv run python -m src.trace_collector.run_all --system mem0
uv run python -m src.trace_collector.run_all --system graphiti
uv run python -m src.trace_collector.analyze --system mem0
uv run python -m src.trace_collector.analyze --system graphiti

For dataset x baseline matrix:

uv run python -m src.trace_collector.run_matrix --dataset all --baseline all --with-breakdown
uv run python -m src.trace_collector.analyze_matrix --dataset all --baseline all
uv run python -m src.trace_collector.matrix_report -o docs/matrix_breakdown.md

How To Analyze The Whole Framework

Use this order to break down results like a MemGPT-style report.

Topline by system

Open data/traces/comparison_chart.png.
Compare prefix, substring, and gap = substring - prefix.

Inspect per-system raw traces

Check prompt construction behavior in raw JSONL:
- input: what changes turn-to-turn
- output: tool calls / structured responses

Inspect substring match logs

Read *_matches.jsonl for:
- InputLen
- Matches[] (MatchStart, MatchEnd, PrevStep, PrevMatchStart, PrevMatchEnd)

Classify each system pattern

High prefix, low gap: stable prompt prefixes.
Low prefix, high substring: dynamic insertion/reordering with reusable blocks.
Low both: little cross-call reuse.

For mem0 vs graphiti, prioritize these questions:

Where does prefix break earliest?
Which repeated blocks are recovered only by substring?
Is the gap driven by context mutation, retrieval position shift, or tool output pattern?

Write conclusions at two levels

Architecture level: why that system creates this shape.
Operations level: expected impact on prefill latency and compute reuse.

Code Reading Order (Recommended)

If your goal is framework-level understanding, read in this order:

src/trace_collector/run_all.py
src/trace_collector/mem0_collector.py
src/trace_collector/graphiti_collector.py
src/trace_collector/tau2_collector.py
src/trace_collector/common.py
src/trace_collector/analyze.py
lmcache-agent-trace/prefix_analysis.py
src/trace_collector/compare_chart.py

Testing

Run project tests only:

uv run pytest -q

pytest ignores reference/generated directories (codebase, vendor, data, lmcache-agent-trace) and collects from tests/.

Artifact Policy

Treat runtime analysis outputs as generated artifacts:
- data/traces/*_result/
- data/traces/*/*.jsonl.bak*
Keep intentional fixtures only.
Prefer committing scripts/config that regenerate outputs over large result blobs.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
analysis		analysis
configs		configs
data		data
docs		docs
lmcache-agent-trace @ 59a1b00		lmcache-agent-trace @ 59a1b00
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
.python-version		.python-version
AGENT_GUIDE.md		AGENT_GUIDE.md
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
benchmark.py		benchmark.py
comparison_report.md		comparison_report.md
docker-compose.yml		docker-compose.yml
main.py		main.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_experiments.sh		run_experiments.sh
setup.sh		setup.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

lmcache-contributor

What This Framework Does

Framework Components

Main Comparison: mem0 (graph) vs graphiti

Dataset Design: What Each Workload Tests

Repository Layout

Requirements

Setup

Run Workflow

How To Analyze The Whole Framework

Code Reading Order (Recommended)

Testing

Artifact Policy

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

lmcache-contributor

What This Framework Does

Framework Components

Main Comparison: mem0 (graph) vs graphiti

Dataset Design: What Each Workload Tests

Repository Layout

Requirements

Setup

Run Workflow

How To Analyze The Whole Framework

Code Reading Order (Recommended)

Testing

Artifact Policy

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages