Skip to content

tteon/kvcache-lab

Repository files navigation

lmcache-contributor

Trace collection and cache-hit analysis workspace for memory-augmented agent workloads.

This repository compares prompt-cache behavior across:

  • mem0 graph memory
  • graphiti temporal graph memory
  • tau2-bench conversational workloads (airline, retail, telecom)
  • LoCoMo long-term memory benchmark (multi-session dialogues)

The core question is:

  • "When prompt prefixes are unstable, how much can substring/block caching recover?"

Primary focus:

  • mem0 (graph) vs graphiti
  • tau2 domains are supporting baselines, not the main comparison target.
  • LoCoMo fills a gap that independent-item datasets cannot cover (see below).

What This Framework Does

End-to-end pipeline:

  1. Run agent workloads and intercept every LLM call.
  2. Normalize calls into a shared JSONL trace format.
  3. Run lmcache-agent-trace/prefix_analysis.py.
  4. Compare prefix vs substring hit rates and their gap.

Matrix extension:

  • Run dataset x baseline experiments with openai_base, mem0, graphiti.
  • Datasets include corpus50, tau2_*, taubench_legacy, and locomo.
  • In matrix mode, tau2_* and taubench_legacy are replayed as prompt-text datasets (not full tau2 environment simulation loops).
  • locomo runs per-conversation (graph resets between conversations, accumulates within).

Data flow:

collector -> trace jsonl -> prefix_analysis -> matches jsonl + plot -> comparison chart

Framework Components

Layer Role Key files
Collector Executes workload and captures LLM calls src/trace_collector/*_collector.py
Normalizer Writes unified trace schema src/trace_collector/common.py
Analyzer Runs LMCache analysis script src/trace_collector/analyze.py
Aggregator Builds cross-system chart src/trace_collector/compare_chart.py
Analysis engine Prefix/substring hit computation lmcache-agent-trace/prefix_analysis.py

Interception strategies by system:

System How calls are intercepted Output trace
mem0 OpenAI response_callback data/traces/mem0_graph/mem0_graph_session.jsonl
graphiti OpenAIGenericClient subclass override data/traces/graphiti_graph/graphiti_graph_session.jsonl
tau2 litellm.completion monkeypatch data/traces/tau2_<domain>/tau2_<domain>_session.jsonl

Main Comparison: mem0 (graph) vs graphiti

If you want to analyze the core framework behavior, start here first:

  1. mem0 and graphiti only collection
  2. Per-system prefix/substring result comparison
  3. Gap interpretation at architecture level

Fast path commands:

uv run python -m src.trace_collector.run_all --system mem0
uv run python -m src.trace_collector.run_all --system graphiti

uv run python -m src.trace_collector.analyze --system mem0
uv run python -m src.trace_collector.analyze --system graphiti

Matrix mode commands:

# 1) Collect matrix traces (all datasets x all baselines)
uv run python -m src.trace_collector.run_matrix --dataset all --baseline all --with-breakdown

# 2) Analyze matrix traces
uv run python -m src.trace_collector.analyze_matrix --dataset all --baseline all

# 3) Build markdown report
uv run python -m src.trace_collector.matrix_report -o docs/matrix_breakdown.md

Core outputs:

  • data/traces/mem0_graph/mem0_graph_session.jsonl
  • data/traces/graphiti_graph/graphiti_graph_session.jsonl
  • data/traces/mem0_result/mem0_matches.jsonl
  • data/traces/graphiti_result/graphiti_matches.jsonl

Interpretation guide:

  • mem0 high prefix + small gap -> stable scaffold/template reuse
  • graphiti lower prefix + larger gap -> dynamic context injection with reusable moved blocks

Dataset Design: What Each Workload Tests

All datasets feed into the same pipeline (collector → trace → prefix_analysis), but they stress different dimensions of prompt-cache behavior.

Independent-item datasets (no cross-item continuity):

Dataset Items Avg size What it measures
corpus50 50 factual statements 141 chars System-prompt reuse under short, independent entity extraction
tau2_airline 50 CS tasks ~1.6k tokens Domain-specific tool-calling prompt patterns
tau2_retail 50 CS tasks ~2.3k tokens Same as airline, different domain vocabulary
tau2_telecom 50 CS tasks ~850 tokens Same, telecom domain
taubench_legacy 471 raw LLM calls 663 chars Legacy baseline replay (no live agent loop)

These datasets share a common limitation: items are independent, so graph memory accumulation effects are weak. The graph grows, but each item doesn't reference or build upon a previous item's context.

Sequential multi-session dataset (cross-session continuity):

Dataset Items Avg size What it measures
locomo 272 sessions (10 conversations × ~27 sessions) 2,893 chars Long-term memory accumulation across months of dialogue

LoCoMo fills a gap that independent-item datasets cannot cover:

  1. Memory accumulation over time: Within each conversation, sessions span months. Graph memory grows across sessions — later sessions can reference facts from earlier ones. This is the real use case for graph memory systems.

  2. Prompt size inflation: For graphiti, the <PREVIOUS_MESSAGES> episode history grows with each session. This means prefix cache hit rates should degrade over time within a conversation — a pattern invisible in independent-item datasets.

  3. Control group comparison: openai_agents (no memory system) should show no cross-session effect, confirming that observed patterns are memory-driven.

The per-conversation collection strategy ensures graph state resets between different people's conversations while accumulating within the same conversation:

conv_0 (Caroline & Melanie, 19 sessions, May→Oct 2023)
  → mem0/graphiti graph builds up across 19 sessions
  → graph resets
conv_1 (Jon & Gina, 19 sessions, Jan→Jul 2023)
  → fresh graph, builds up across 19 sessions
  → ...

Expected observations:

  • graphiti: prefix hit rate drops session-over-session; substring gap widens
  • mem0: entity/relation graph enriches; substring reuse increases with graph density
  • openai_agents: flat across sessions (no memory = no accumulation)

Repository Layout

src/trace_collector/
  common.py            # env resolution, test corpus, TraceLogger
  datasets.py          # dataset loaders for matrix experiments
  neo4j_metrics.py     # workload breakdown logger + Neo4j query instrumentation
  run_all.py           # collector orchestrator
  run_matrix.py        # dataset x baseline trace orchestrator
  mem0_collector.py    # mem0 trace collection
  graphiti_collector.py# graphiti trace collection
  openai_base_collector.py # direct OpenAI baseline collection
  tau2_collector.py    # tau2 trace collection
  analyze.py           # wrapper around prefix_analysis.py
  analyze_matrix.py    # matrix trace analyzer
  matrix_report.py     # matrix markdown report generator
  compare_chart.py     # cross-system comparison chart
  paper_figures.py     # publication LaTeX tables + PDF figures

data/traces/
  */*.jsonl            # raw traces
  */*_breakdown.jsonl  # workload breakdown events (prompt, cypher, snapshots)
  *_result/*.jsonl     # substring match logs
  *_result/*.png       # per-system hit-rate plots
  comparison_chart.png # combined chart

data/locomo/
  locomo10.json        # LoCoMo benchmark (10 conversations, snap-research/locomo)

docs/paper/
  table*.tex           # LaTeX tables (auto-generated by paper_figures.py)
  fig*.pdf             # PDF figures (auto-generated by paper_figures.py)

lmcache-agent-trace/
  prefix_analysis.py   # core algorithm (tokenize + prefix/substring scoring)

Breakdown db_snapshot stages:

  • before_collection: full DB snapshot before a run
  • after_step: lightweight per-item snapshot (step field included)
  • after_collection: full DB snapshot after a run

Requirements

  • Python >=3.10
  • uv
  • Docker + Docker Compose (for DozerDB/Neo4j)
  • LLM API key and endpoint

Setup

  1. Install dependencies:
uv sync --dev
  1. Prepare Neo4j directories/plugins and start DB:
./setup.sh
docker-compose up -d
  1. Create isolated databases:
CREATE DATABASE mem0store;
CREATE DATABASE graphitistore;
  1. Configure .env:
OPENAI_API_KEY=...

# Optional compatibility settings
GPU_API_KEY=...
GPU_ENDPOINT=https://api.openai.com/v1
GPU_MODEL=gpt-4o-mini

# Preferred runtime overrides
LLM_API_KEY=...
LLM_API_BASE=https://api.openai.com/v1
LLM_MODEL=gpt-4o-mini

NEO4J_URI=bolt://localhost:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=password

Environment variable resolution:

  • LLM_API_KEY -> GPU_API_KEY -> OPENAI_API_KEY
  • LLM_API_BASE -> GPU_ENDPOINT -> OPENAI_BASE_URL -> OpenAI default
  • LLM_MODEL -> GPU_MODEL

Run Workflow

  1. Verify endpoint capability (chat, tool calling, JSON mode):
uv run python -m src.trace_collector.test_endpoint
  1. Collect traces:
uv run python -m src.trace_collector.run_all --system all
  1. Analyze hit rates:
uv run python -m src.trace_collector.analyze --system all
  1. Build cross-system chart:
uv run python -m src.trace_collector.compare_chart

For core analysis only (recommended first pass):

uv run python -m src.trace_collector.run_all --system mem0
uv run python -m src.trace_collector.run_all --system graphiti
uv run python -m src.trace_collector.analyze --system mem0
uv run python -m src.trace_collector.analyze --system graphiti

For dataset x baseline matrix:

uv run python -m src.trace_collector.run_matrix --dataset all --baseline all --with-breakdown
uv run python -m src.trace_collector.analyze_matrix --dataset all --baseline all
uv run python -m src.trace_collector.matrix_report -o docs/matrix_breakdown.md

How To Analyze The Whole Framework

Use this order to break down results like a MemGPT-style report.

  1. Topline by system
  • Open data/traces/comparison_chart.png.
  • Compare prefix, substring, and gap = substring - prefix.
  1. Inspect per-system raw traces
  • Check prompt construction behavior in raw JSONL:
    • input: what changes turn-to-turn
    • output: tool calls / structured responses
  1. Inspect substring match logs
  • Read *_matches.jsonl for:
    • InputLen
    • Matches[] (MatchStart, MatchEnd, PrevStep, PrevMatchStart, PrevMatchEnd)
  1. Classify each system pattern
  • High prefix, low gap: stable prompt prefixes.
  • Low prefix, high substring: dynamic insertion/reordering with reusable blocks.
  • Low both: little cross-call reuse.

For mem0 vs graphiti, prioritize these questions:

  • Where does prefix break earliest?
  • Which repeated blocks are recovered only by substring?
  • Is the gap driven by context mutation, retrieval position shift, or tool output pattern?
  1. Write conclusions at two levels
  • Architecture level: why that system creates this shape.
  • Operations level: expected impact on prefill latency and compute reuse.

Code Reading Order (Recommended)

If your goal is framework-level understanding, read in this order:

  1. src/trace_collector/run_all.py
  2. src/trace_collector/mem0_collector.py
  3. src/trace_collector/graphiti_collector.py
  4. src/trace_collector/tau2_collector.py
  5. src/trace_collector/common.py
  6. src/trace_collector/analyze.py
  7. lmcache-agent-trace/prefix_analysis.py
  8. src/trace_collector/compare_chart.py

Testing

Run project tests only:

uv run pytest -q

pytest ignores reference/generated directories (codebase, vendor, data, lmcache-agent-trace) and collects from tests/.

Artifact Policy

  • Treat runtime analysis outputs as generated artifacts:
    • data/traces/*_result/
    • data/traces/*/*.jsonl.bak*
  • Keep intentional fixtures only.
  • Prefer committing scripts/config that regenerate outputs over large result blobs.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages