[Bug]: KVCacheManagerV2: windowed pool min_slots underflow causes scheduler deadlock at high concurrency

### System Info

CPU architecture: x86_64
CPU/Host memory size: 2.0 TiB
GPU properties
GPU name: NVIDIA H100-like (SM89)
GPU memory size: 40 GiB KV cache budget (per-rank)
Libraries
TensorRT-LLM branch: main (merge-base commit 035de5d185)
TensorRT-LLM version: 0.19.0.dev
TensorRT: 10.13.2
PyTorch: 2.9.0
CUDA: 12.9
OS: Ubuntu 24.04.2 LTS
Python: 3.12.3
$ pip show tensorrt_llm tensorrt torch
Name: tensorrt_llm
Version: 0.19.0.dev
Name: tensorrt
Version: 10.13.2
Name: torch
Version: 2.9.0

### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

Problem
When using KVCacheManagerV2 with multiple pool groups containing windowed attention types (e.g., SWA with window_size < tokens_per_block), the computed min_slots for windowed pools can be as low as 1, causing scheduler deadlock when concurrent decode requests exceed that limit.

This affects any model with heterogeneous KV cache pools where some pools use windowed eviction — not just DeepSeek-V4.

Root Cause
_build_cache_config() constructs typical_step with max_batch_size - 1 decode requests using capacity=max_seq_len and history_length=generation_history_length. For windowed pools, the stale range consumes nearly all blocks:

Example (DeepSeek-V4, tokens_per_block=256, max_seq_len=137984):

total_blocks = ceil(137984 / 256) = 539

Pool Group 0 (SWA, window=128):
  stale_end = (137983 + 1 - 128) // 256 = 538
  non_stale = 539 - 538 = 1     ← only 1 slot per request!

Pool Group 1 (no window):
  stale = 0
  non_stale = 539               ← full allocation

Pool Group 2 (window=8):
  stale_end = (137983 + 1 - 8) // 256 = 538
  non_stale = 539 - 538 = 1     ← only 1 slot per request!
The resulting min_slots = [1, 539, 1] for [SWA, COMPRESS, STATE]. Since each decode request needs exactly 1 slot in windowed pools, min_slots=1 means only 1 concurrent decode request can be served — any more will deadlock the V2 scheduler.

### Expected behavior

The scheduler should be able to serve up to max_batch_size concurrent decode requests without deadlock. Windowed pools should reserve at least max_batch_size slots as their minimum, since each decode request needs exactly 1 slot in every pool group.

Expected min_slots (for max_batch_size=64):

pg0 (SWA, window=128):  64
pg1 (no window):        539
pg2 (window=8):         64

### actual behavior

Windowed pools get min_slots=1 because the typical_step constraint uses max_seq_len as capacity, which makes the stale range consume all but 1 block in windowed pools. The ratio computation then allocates almost all memory to the non-windowed pool, starving windowed pools.

Actual min_slots:

pg0 (SWA, window=128):  1   ← should be max_batch_size
pg1 (no window):        539
pg2 (window=8):         1   ← should be max_batch_size
This causes the V2 scheduler to deadlock when more than 1 decode request is active, because it cannot find a free slot in pg0/pg2 for the second request.

### additional notes

Proposed Fix
Add a generic constraint in KVCacheManagerV2._build_cache_config() representing max_batch_size concurrent decode requests at the tail of their windows:

```python
# Generic constraint: max_batch_size concurrent decode requests.
# Ensures windowed pools reserve at least max_batch_size slots,
# since each decode request needs exactly 1 slot when
# window_size < tokens_per_block.
constraints.append(
    BatchDesc(
        [KVCacheDesc(capacity=tokens_per_block, history_length=tokens_per_block - 1)]
        * max_batch_size
    )
)
```

Why this works: Each KVCacheDesc(capacity=tokens_per_block, history=tokens_per_block-1) represents a decode request at the boundary where all blocks fall within the window. For every pool group (windowed or not), stale_end = 0, so non_stale = 1. Multiplied by max_batch_size, this produces min_slots >= max_batch_size for all pools.

Why max_batch_size is the right floor: The scheduler can issue up to max_batch_size concurrent decode requests. Each decode request needs at least 1 slot in every pool group. If any pool has fewer slots than max_batch_size, the scheduler cannot make progress → deadlock.

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: KVCacheManagerV2: windowed pool min_slots underflow causes scheduler deadlock at high concurrency #15401

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Bug]: KVCacheManagerV2: windowed pool min_slots underflow causes scheduler deadlock at high concurrency #15401

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions