[None][feat] DSA: adaptive indexer prefill chunk size for long sequences by lfr-0531 · Pull Request #15459 · NVIDIA/TensorRT-LLM

lfr-0531 · 2026-06-17T13:02:37Z

Background / Motivation

The DSA indexer's fp8_fp4_mqa_logits / fp8_mqa_logits activation memory
scales with indexer_max_chunk_size * K_compressed (the compressed KV length
of the current request). For very long sequences this can OOM — e.g. a
~500K-token request with the default 32K chunk size needs ~16GB of activation
memory.

The current workaround is to uniformly lower indexer_max_chunk_size (e.g. to
8K). That avoids the OOM but costs prefill throughput for the common case,
where the vast majority of requests are far below these lengths — a blunt
one-size-fits-all setting for what is really a long-tail problem.

Summary

Select the indexer prefill chunk size per-batch based on the largest
compressed KV length (K_compressed) among the context requests, instead of
always using the statically-configured value:

max K_compressed in batch	effective chunk size
`> 512K`	8K
`[256K, 512K]`	16K
`< 256K`	configured (default 32K, unchanged)

Implementation:

New helper select_indexer_chunk_size(configured_chunk_size, max_k_compressed)
in dsa.py. It only ever reduces the configured chunk size (never
increases it), so any explicitly-configured value still acts as an upper
bound.
Wired into Indexer.prepare_for_chunked_prefill, on the indexer's own
chunking path only. The MLA chunked-prefill path is untouched (it already
bounds the chunk to the MLA chunk).
max K_compressed is read from indexer_params.kv_lens (already host-side
during prefill prepare).

Impact

Long requests (>256K): smaller chunk → lower indexer activation memory,
avoids the OOM without manual tuning.
Common case (<256K): behavior unchanged — keeps the larger,
higher-throughput chunk size.
Safe to vary per-batch because the prefill path does not use CUDA graphs.
No API/config changes; existing indexer_max_chunk_size continues to act as
the upper bound.

Notes

Draft for early review / discussion. Thresholds (256K / 512K) and chunk
sizes (8K / 16K / 32K) are centralized in _INDEXER_CHUNK_SIZE_HEURISTIC for
easy tuning.
TODO before un-drafting: add a unit test for select_indexer_chunk_size
boundaries and validate end-to-end memory/perf on a long-sequence workload.

🤖 Generated with Claude Code

The indexer's fp8_fp4_mqa_logits / fp8_mqa_logits activation memory scales with indexer_max_chunk_size * K_compressed (the compressed KV length of the current request). For very long sequences this can OOM: e.g. a ~500K-token request with the default 32K chunk size needs ~16GB of activation memory. The previous workaround was to uniformly lower indexer_max_chunk_size (e.g. to 8K), but that costs prefill throughput for the common case where the vast majority of requests are far below these lengths. Instead, select the indexer prefill chunk size per-batch based on the largest compressed KV length among the context requests: - max K_compressed > 512K -> 8K chunk - 256K <= max K_compressed <= 512K -> 16K chunk - max K_compressed < 256K -> configured chunk size (unchanged) The heuristic only ever reduces the configured chunk size (never increases it), so the common case keeps its larger, higher-throughput chunk. It is applied only on the indexer's own chunking path (not the MLA chunked-prefill path, which already bounds chunk size) and is safe to vary per-batch because the prefill path does not use CUDA graphs. Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

lfr-0531 · 2026-06-17T13:32:45Z

/bot run

tensorrt-cicd · 2026-06-17T13:38:50Z

PR_Github #54853 [ run ] triggered by Bot. Commit: ff0e8a5 Link to invocation

tensorrt-cicd · 2026-06-17T15:15:14Z

PR_Github #54853 [ run ] completed with state SUCCESS. Commit: ff0e8a5
/LLM/main/L0_MergeRequest_PR pipeline #43862 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lfr-0531 mentioned this pull request Jun 17, 2026

[None][feat] DSA: adaptive indexer prefill chunk size for long sequences #15458

Closed

github-actions Bot assigned lfr-0531 Jun 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][feat] DSA: adaptive indexer prefill chunk size for long sequences#15459

[None][feat] DSA: adaptive indexer prefill chunk size for long sequences#15459
lfr-0531 wants to merge 1 commit into
NVIDIA:feat/deepseek_v4from
lfr-0531:user/fanrongl/dsv4-indexer-chunk-heuristic

lfr-0531 commented Jun 17, 2026

Uh oh!

lfr-0531 commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lfr-0531 commented Jun 17, 2026

Background / Motivation

Summary

Impact

Notes

Uh oh!

lfr-0531 commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

tensorrt-cicd commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants