Reserve KV cache slots for concurrent decode in V2 by Kevin-Li-2025 · Pull Request #15462 · NVIDIA/TensorRT-LLM

Kevin-Li-2025 · 2026-06-17T18:22:35Z

Description

KVCacheManagerV2 can under-reserve windowed pool slots when capacity planning only sees long-history requests. For small sliding windows, the stale range can leave a windowed pool with a min-slot floor of 1, which can deadlock scheduling once concurrent decode requests exceed that single slot.

This adds a generic concurrent-decode constraint to the V2 cache config: max_batch_size requests at one token block with history_length=tokens_per_block - 1. Each decode request needs one slot in every pool group, so this floors the min slots at max_batch_size without changing scheduler behavior.

The config also sets the existing StorageManager fallback typical-step explicitly, so adding the constraint does not accidentally switch ratio selection to constraint-only sizing.

Tests

python3 -m py_compile tensorrt_llm/_torch/pyexecutor/kv_cache_manager_v2.py tests/unittest/_torch/executor/test_per_layer_head_dim.py
Static regression check confirming the new constraint and preserved typical-step are present
git diff --check

I attempted the targeted pytest, but local collection is blocked by a missing nvtx dependency in this checkout.

Summary by CodeRabbit

New Features
- Enhanced KV cache management with concurrent decoding constraints to optimize cache allocation for simultaneous requests.
Tests
- Added test coverage for cache configuration in concurrent decoding scenarios.

coderabbitai · 2026-06-17T18:26:15Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 293ba2c6-81fd-4e39-86e3-1dbd6929c301

📥 Commits

Reviewing files that changed from the base of the PR and between 42a3e55 and 5755532.

📒 Files selected for processing (2)

tensorrt_llm/_torch/pyexecutor/kv_cache_manager_v2.py
tests/unittest/_torch/executor/test_per_layer_head_dim.py

📝 Walkthrough

Walkthrough

KVCacheManagerV2._build_cache_config now passes a constraints field and a typical_step to KVCacheManagerConfigPy. The constraints are produced by a new _build_concurrent_decode_constraint static method that returns a BatchDesc of max_batch_size KVCacheDesc entries, each sized by tokens_per_block. A unit test verifies the resulting constraint and typical_step shapes.

Changes

Concurrent decode constraint for windowed KV cache pools

Layer / File(s)	Summary
Constraint helper and cache config wiring `tensorrt_llm/_torch/pyexecutor/kv_cache_manager_v2.py`	Adds `KVCacheDesc` import, introduces the `_build_concurrent_decode_constraint` static method returning a `BatchDesc` of `max_batch_size` entries sized by `tokens_per_block`, and wires `constraints` plus a `typical_step` fallback `BatchDesc(KVCacheDesc(capacity=2049, history_length=2048))` into `KVCacheManagerConfigPy` construction.
Unit test for constraint shape `tests/unittest/_torch/executor/test_per_layer_head_dim.py`	Imports `GpuCacheTierConfig` and adds `test_build_cache_config_reserves_concurrent_decode_slots`, which instantiates `KVCacheManagerV2` without calling `__init__`, invokes `_build_cache_config`, and asserts constraint `kv_caches` length equals `max_batch_size`, per-entry `capacity`/`history_length`, and `typical_step` sizing.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

niukuo
zeroepoch
yizhang-nv
tburt-nv

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title "Reserve KV cache slots for concurrent decode in V2" directly summarizes the main change: adding concurrent-decode constraint logic to KVCacheManagerV2.
Description check	✅ Passed	The PR description clearly explains the issue (windowed pool min_slots underflow), the solution (generic concurrent-decode constraint), testing approach, and aligns with the template's requirements for explanation and test coverage.
Linked Issues check	✅ Passed	The code changes implement the exact solution proposed in issue `#15401`: adding a concurrent-decode constraint with max_batch_size requests at tokens_per_block capacity and tokens_per_block-1 history to reserve minimum slots.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to addressing issue `#15401`: modifying KVCacheManagerV2._build_cache_config to add constraints and setting typical_step, plus adding a regression test.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: Kevin-Li-2025 <2242139@qq.com>

Kevin-Li-2025 requested a review from a team as a code owner June 17, 2026 18:22

github-actions Bot assigned Kevin-Li-2025 Jun 17, 2026

Reserve KV cache slots for concurrent decode

36634a6

Signed-off-by: Kevin-Li-2025 <2242139@qq.com>

Kevin-Li-2025 force-pushed the kevin/fix-kv-cache-windowed-min-slots branch from 5755532 to 36634a6 Compare June 19, 2026 02:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reserve KV cache slots for concurrent decode in V2#15462

Reserve KV cache slots for concurrent decode in V2#15462
Kevin-Li-2025 wants to merge 1 commit into
NVIDIA:mainfrom
Kevin-Li-2025:kevin/fix-kv-cache-windowed-min-slots

Kevin-Li-2025 commented Jun 17, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 17, 2026

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Kevin-Li-2025 commented Jun 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 17, 2026

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Kevin-Li-2025 commented Jun 17, 2026 •

edited by coderabbitai Bot

Loading