System Info
CPU architecture: x86_64
CPU/Host memory size: 2.0 TiB
GPU properties
GPU name: NVIDIA H100-like (SM89)
GPU memory size: 40 GiB KV cache budget (per-rank)
Libraries
TensorRT-LLM branch: main (merge-base commit 035de5d)
TensorRT-LLM version: 0.19.0.dev
TensorRT: 10.13.2
PyTorch: 2.9.0
CUDA: 12.9
OS: Ubuntu 24.04.2 LTS
Python: 3.12.3
$ pip show tensorrt_llm tensorrt torch
Name: tensorrt_llm
Version: 0.19.0.dev
Name: tensorrt
Version: 10.13.2
Name: torch
Version: 2.9.0
Who can help?
No response
Information
Tasks
Reproduction
Problem
When using KVCacheManagerV2 with multiple pool groups containing windowed attention types (e.g., SWA with window_size < tokens_per_block), the computed min_slots for windowed pools can be as low as 1, causing scheduler deadlock when concurrent decode requests exceed that limit.
This affects any model with heterogeneous KV cache pools where some pools use windowed eviction — not just DeepSeek-V4.
Root Cause
_build_cache_config() constructs typical_step with max_batch_size - 1 decode requests using capacity=max_seq_len and history_length=generation_history_length. For windowed pools, the stale range consumes nearly all blocks:
Example (DeepSeek-V4, tokens_per_block=256, max_seq_len=137984):
total_blocks = ceil(137984 / 256) = 539
Pool Group 0 (SWA, window=128):
stale_end = (137983 + 1 - 128) // 256 = 538
non_stale = 539 - 538 = 1 ← only 1 slot per request!
Pool Group 1 (no window):
stale = 0
non_stale = 539 ← full allocation
Pool Group 2 (window=8):
stale_end = (137983 + 1 - 8) // 256 = 538
non_stale = 539 - 538 = 1 ← only 1 slot per request!
The resulting min_slots = [1, 539, 1] for [SWA, COMPRESS, STATE]. Since each decode request needs exactly 1 slot in windowed pools, min_slots=1 means only 1 concurrent decode request can be served — any more will deadlock the V2 scheduler.
Expected behavior
The scheduler should be able to serve up to max_batch_size concurrent decode requests without deadlock. Windowed pools should reserve at least max_batch_size slots as their minimum, since each decode request needs exactly 1 slot in every pool group.
Expected min_slots (for max_batch_size=64):
pg0 (SWA, window=128): 64
pg1 (no window): 539
pg2 (window=8): 64
actual behavior
Windowed pools get min_slots=1 because the typical_step constraint uses max_seq_len as capacity, which makes the stale range consume all but 1 block in windowed pools. The ratio computation then allocates almost all memory to the non-windowed pool, starving windowed pools.
Actual min_slots:
pg0 (SWA, window=128): 1 ← should be max_batch_size
pg1 (no window): 539
pg2 (window=8): 1 ← should be max_batch_size
This causes the V2 scheduler to deadlock when more than 1 decode request is active, because it cannot find a free slot in pg0/pg2 for the second request.
additional notes
Proposed Fix
Add a generic constraint in KVCacheManagerV2._build_cache_config() representing max_batch_size concurrent decode requests at the tail of their windows:
# Generic constraint: max_batch_size concurrent decode requests.
# Ensures windowed pools reserve at least max_batch_size slots,
# since each decode request needs exactly 1 slot when
# window_size < tokens_per_block.
constraints.append(
BatchDesc(
[KVCacheDesc(capacity=tokens_per_block, history_length=tokens_per_block - 1)]
* max_batch_size
)
)
Why this works: Each KVCacheDesc(capacity=tokens_per_block, history=tokens_per_block-1) represents a decode request at the boundary where all blocks fall within the window. For every pool group (windowed or not), stale_end = 0, so non_stale = 1. Multiplied by max_batch_size, this produces min_slots >= max_batch_size for all pools.
Why max_batch_size is the right floor: The scheduler can issue up to max_batch_size concurrent decode requests. Each decode request needs at least 1 slot in every pool group. If any pool has fewer slots than max_batch_size, the scheduler cannot make progress → deadlock.
Before submitting a new issue...
System Info
CPU architecture: x86_64
CPU/Host memory size: 2.0 TiB
GPU properties
GPU name: NVIDIA H100-like (SM89)
GPU memory size: 40 GiB KV cache budget (per-rank)
Libraries
TensorRT-LLM branch: main (merge-base commit 035de5d)
TensorRT-LLM version: 0.19.0.dev
TensorRT: 10.13.2
PyTorch: 2.9.0
CUDA: 12.9
OS: Ubuntu 24.04.2 LTS
Python: 3.12.3
$ pip show tensorrt_llm tensorrt torch
Name: tensorrt_llm
Version: 0.19.0.dev
Name: tensorrt
Version: 10.13.2
Name: torch
Version: 2.9.0
Who can help?
No response
Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Reproduction
Problem
When using KVCacheManagerV2 with multiple pool groups containing windowed attention types (e.g., SWA with window_size < tokens_per_block), the computed min_slots for windowed pools can be as low as 1, causing scheduler deadlock when concurrent decode requests exceed that limit.
This affects any model with heterogeneous KV cache pools where some pools use windowed eviction — not just DeepSeek-V4.
Root Cause
_build_cache_config() constructs typical_step with max_batch_size - 1 decode requests using capacity=max_seq_len and history_length=generation_history_length. For windowed pools, the stale range consumes nearly all blocks:
Example (DeepSeek-V4, tokens_per_block=256, max_seq_len=137984):
total_blocks = ceil(137984 / 256) = 539
Pool Group 0 (SWA, window=128):
stale_end = (137983 + 1 - 128) // 256 = 538
non_stale = 539 - 538 = 1 ← only 1 slot per request!
Pool Group 1 (no window):
stale = 0
non_stale = 539 ← full allocation
Pool Group 2 (window=8):
stale_end = (137983 + 1 - 8) // 256 = 538
non_stale = 539 - 538 = 1 ← only 1 slot per request!
The resulting min_slots = [1, 539, 1] for [SWA, COMPRESS, STATE]. Since each decode request needs exactly 1 slot in windowed pools, min_slots=1 means only 1 concurrent decode request can be served — any more will deadlock the V2 scheduler.
Expected behavior
The scheduler should be able to serve up to max_batch_size concurrent decode requests without deadlock. Windowed pools should reserve at least max_batch_size slots as their minimum, since each decode request needs exactly 1 slot in every pool group.
Expected min_slots (for max_batch_size=64):
pg0 (SWA, window=128): 64
pg1 (no window): 539
pg2 (window=8): 64
actual behavior
Windowed pools get min_slots=1 because the typical_step constraint uses max_seq_len as capacity, which makes the stale range consume all but 1 block in windowed pools. The ratio computation then allocates almost all memory to the non-windowed pool, starving windowed pools.
Actual min_slots:
pg0 (SWA, window=128): 1 ← should be max_batch_size
pg1 (no window): 539
pg2 (window=8): 1 ← should be max_batch_size
This causes the V2 scheduler to deadlock when more than 1 decode request is active, because it cannot find a free slot in pg0/pg2 for the second request.
additional notes
Proposed Fix
Add a generic constraint in KVCacheManagerV2._build_cache_config() representing max_batch_size concurrent decode requests at the tail of their windows:
Why this works: Each KVCacheDesc(capacity=tokens_per_block, history=tokens_per_block-1) represents a decode request at the boundary where all blocks fall within the window. For every pool group (windowed or not), stale_end = 0, so non_stale = 1. Multiplied by max_batch_size, this produces min_slots >= max_batch_size for all pools.
Why max_batch_size is the right floor: The scheduler can issue up to max_batch_size concurrent decode requests. Each decode request needs at least 1 slot in every pool group. If any pool has fewer slots than max_batch_size, the scheduler cannot make progress → deadlock.
Before submitting a new issue...