Skip to content

Reserve KV cache slots for concurrent decode in V2#15462

Open
Kevin-Li-2025 wants to merge 1 commit into
NVIDIA:mainfrom
Kevin-Li-2025:kevin/fix-kv-cache-windowed-min-slots
Open

Reserve KV cache slots for concurrent decode in V2#15462
Kevin-Li-2025 wants to merge 1 commit into
NVIDIA:mainfrom
Kevin-Li-2025:kevin/fix-kv-cache-windowed-min-slots

Conversation

@Kevin-Li-2025

@Kevin-Li-2025 Kevin-Li-2025 commented Jun 17, 2026

Copy link
Copy Markdown

Description

Fixes #15401.

KVCacheManagerV2 can under-reserve windowed pool slots when capacity planning only sees long-history requests. For small sliding windows, the stale range can leave a windowed pool with a min-slot floor of 1, which can deadlock scheduling once concurrent decode requests exceed that single slot.

This adds a generic concurrent-decode constraint to the V2 cache config: max_batch_size requests at one token block with history_length=tokens_per_block - 1. Each decode request needs one slot in every pool group, so this floors the min slots at max_batch_size without changing scheduler behavior.

The config also sets the existing StorageManager fallback typical-step explicitly, so adding the constraint does not accidentally switch ratio selection to constraint-only sizing.

Tests

  • python3 -m py_compile tensorrt_llm/_torch/pyexecutor/kv_cache_manager_v2.py tests/unittest/_torch/executor/test_per_layer_head_dim.py
  • Static regression check confirming the new constraint and preserved typical-step are present
  • git diff --check

I attempted the targeted pytest, but local collection is blocked by a missing nvtx dependency in this checkout.

Summary by CodeRabbit

  • New Features

    • Enhanced KV cache management with concurrent decoding constraints to optimize cache allocation for simultaneous requests.
  • Tests

    • Added test coverage for cache configuration in concurrent decoding scenarios.

@Kevin-Li-2025 Kevin-Li-2025 requested a review from a team as a code owner June 17, 2026 18:22
@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 293ba2c6-81fd-4e39-86e3-1dbd6929c301

📥 Commits

Reviewing files that changed from the base of the PR and between 42a3e55 and 5755532.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/pyexecutor/kv_cache_manager_v2.py
  • tests/unittest/_torch/executor/test_per_layer_head_dim.py

📝 Walkthrough

Walkthrough

KVCacheManagerV2._build_cache_config now passes a constraints field and a typical_step to KVCacheManagerConfigPy. The constraints are produced by a new _build_concurrent_decode_constraint static method that returns a BatchDesc of max_batch_size KVCacheDesc entries, each sized by tokens_per_block. A unit test verifies the resulting constraint and typical_step shapes.

Changes

Concurrent decode constraint for windowed KV cache pools

Layer / File(s) Summary
Constraint helper and cache config wiring
tensorrt_llm/_torch/pyexecutor/kv_cache_manager_v2.py
Adds KVCacheDesc import, introduces the _build_concurrent_decode_constraint static method returning a BatchDesc of max_batch_size entries sized by tokens_per_block, and wires constraints plus a typical_step fallback BatchDesc(KVCacheDesc(capacity=2049, history_length=2048)) into KVCacheManagerConfigPy construction.
Unit test for constraint shape
tests/unittest/_torch/executor/test_per_layer_head_dim.py
Imports GpuCacheTierConfig and adds test_build_cache_config_reserves_concurrent_decode_slots, which instantiates KVCacheManagerV2 without calling __init__, invokes _build_cache_config, and asserts constraint kv_caches length equals max_batch_size, per-entry capacity/history_length, and typical_step sizing.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

  • niukuo
  • zeroepoch
  • yizhang-nv
  • tburt-nv
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title "Reserve KV cache slots for concurrent decode in V2" directly summarizes the main change: adding concurrent-decode constraint logic to KVCacheManagerV2.
Description check ✅ Passed The PR description clearly explains the issue (windowed pool min_slots underflow), the solution (generic concurrent-decode constraint), testing approach, and aligns with the template's requirements for explanation and test coverage.
Linked Issues check ✅ Passed The code changes implement the exact solution proposed in issue #15401: adding a concurrent-decode constraint with max_batch_size requests at tokens_per_block capacity and tokens_per_block-1 history to reserve minimum slots.
Out of Scope Changes check ✅ Passed All changes are directly scoped to addressing issue #15401: modifying KVCacheManagerV2._build_cache_config to add constraints and setting typical_step, plus adding a regression test.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Kevin-Li-2025 <2242139@qq.com>
@Kevin-Li-2025 Kevin-Li-2025 force-pushed the kevin/fix-kv-cache-windowed-min-slots branch from 5755532 to 36634a6 Compare June 19, 2026 02:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: KVCacheManagerV2: windowed pool min_slots underflow causes scheduler deadlock at high concurrency

1 participant