Reserve KV cache slots for concurrent decode in V2#15462
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthrough
ChangesConcurrent decode constraint for windowed KV cache pools
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Signed-off-by: Kevin-Li-2025 <2242139@qq.com>
5755532 to
36634a6
Compare
Description
Fixes #15401.
KVCacheManagerV2can under-reserve windowed pool slots when capacity planning only sees long-history requests. For small sliding windows, the stale range can leave a windowed pool with a min-slot floor of 1, which can deadlock scheduling once concurrent decode requests exceed that single slot.This adds a generic concurrent-decode constraint to the V2 cache config:
max_batch_sizerequests at one token block withhistory_length=tokens_per_block - 1. Each decode request needs one slot in every pool group, so this floors the min slots atmax_batch_sizewithout changing scheduler behavior.The config also sets the existing StorageManager fallback typical-step explicitly, so adding the constraint does not accidentally switch ratio selection to constraint-only sizing.
Tests
python3 -m py_compile tensorrt_llm/_torch/pyexecutor/kv_cache_manager_v2.py tests/unittest/_torch/executor/test_per_layer_head_dim.pygit diff --checkI attempted the targeted pytest, but local collection is blocked by a missing
nvtxdependency in this checkout.Summary by CodeRabbit
New Features
Tests