Skip to content

[Bug]: KVCacheManagerV2: windowed pool min_slots underflow causes scheduler deadlock at high concurrency #15401

Description

@lwlsaysnuaa

System Info

CPU architecture: x86_64
CPU/Host memory size: 2.0 TiB
GPU properties
GPU name: NVIDIA H100-like (SM89)
GPU memory size: 40 GiB KV cache budget (per-rank)
Libraries
TensorRT-LLM branch: main (merge-base commit 035de5d)
TensorRT-LLM version: 0.19.0.dev
TensorRT: 10.13.2
PyTorch: 2.9.0
CUDA: 12.9
OS: Ubuntu 24.04.2 LTS
Python: 3.12.3
$ pip show tensorrt_llm tensorrt torch
Name: tensorrt_llm
Version: 0.19.0.dev
Name: tensorrt
Version: 10.13.2
Name: torch
Version: 2.9.0

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Problem
When using KVCacheManagerV2 with multiple pool groups containing windowed attention types (e.g., SWA with window_size < tokens_per_block), the computed min_slots for windowed pools can be as low as 1, causing scheduler deadlock when concurrent decode requests exceed that limit.

This affects any model with heterogeneous KV cache pools where some pools use windowed eviction — not just DeepSeek-V4.

Root Cause
_build_cache_config() constructs typical_step with max_batch_size - 1 decode requests using capacity=max_seq_len and history_length=generation_history_length. For windowed pools, the stale range consumes nearly all blocks:

Example (DeepSeek-V4, tokens_per_block=256, max_seq_len=137984):

total_blocks = ceil(137984 / 256) = 539

Pool Group 0 (SWA, window=128):
stale_end = (137983 + 1 - 128) // 256 = 538
non_stale = 539 - 538 = 1 ← only 1 slot per request!

Pool Group 1 (no window):
stale = 0
non_stale = 539 ← full allocation

Pool Group 2 (window=8):
stale_end = (137983 + 1 - 8) // 256 = 538
non_stale = 539 - 538 = 1 ← only 1 slot per request!
The resulting min_slots = [1, 539, 1] for [SWA, COMPRESS, STATE]. Since each decode request needs exactly 1 slot in windowed pools, min_slots=1 means only 1 concurrent decode request can be served — any more will deadlock the V2 scheduler.

Expected behavior

The scheduler should be able to serve up to max_batch_size concurrent decode requests without deadlock. Windowed pools should reserve at least max_batch_size slots as their minimum, since each decode request needs exactly 1 slot in every pool group.

Expected min_slots (for max_batch_size=64):

pg0 (SWA, window=128): 64
pg1 (no window): 539
pg2 (window=8): 64

actual behavior

Windowed pools get min_slots=1 because the typical_step constraint uses max_seq_len as capacity, which makes the stale range consume all but 1 block in windowed pools. The ratio computation then allocates almost all memory to the non-windowed pool, starving windowed pools.

Actual min_slots:

pg0 (SWA, window=128): 1 ← should be max_batch_size
pg1 (no window): 539
pg2 (window=8): 1 ← should be max_batch_size
This causes the V2 scheduler to deadlock when more than 1 decode request is active, because it cannot find a free slot in pg0/pg2 for the second request.

additional notes

Proposed Fix
Add a generic constraint in KVCacheManagerV2._build_cache_config() representing max_batch_size concurrent decode requests at the tail of their windows:

# Generic constraint: max_batch_size concurrent decode requests.
# Ensures windowed pools reserve at least max_batch_size slots,
# since each decode request needs exactly 1 slot when
# window_size < tokens_per_block.
constraints.append(
    BatchDesc(
        [KVCacheDesc(capacity=tokens_per_block, history_length=tokens_per_block - 1)]
        * max_batch_size
    )
)

Why this works: Each KVCacheDesc(capacity=tokens_per_block, history=tokens_per_block-1) represents a decode request at the boundary where all blocks fall within the window. For every pool group (windowed or not), stale_end = 0, so non_stale = 1. Multiplied by max_batch_size, this produces min_slots >= max_batch_size for all pools.

Why max_batch_size is the right floor: The scheduler can issue up to max_batch_size concurrent decode requests. Each decode request needs at least 1 slot in every pool group. If any pool has fewer slots than max_batch_size, the scheduler cannot make progress → deadlock.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    KV-Cache Managementkv-cache management for efficient LLM inferencebugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions