[None][feat] DSA: adaptive indexer prefill chunk size for long sequences by lfr-0531 · Pull Request #15458 · NVIDIA/TensorRT-LLM

lfr-0531 · 2026-06-17T12:53:44Z

Background / Motivation

The DSA indexer's fp8_fp4_mqa_logits / fp8_mqa_logits activation memory
scales with indexer_max_chunk_size * K_compressed (the compressed KV length
of the current request). For very long sequences this can OOM — e.g. a
~500K-token request with the default 32K chunk size needs ~16GB of activation
memory.

The current workaround is to uniformly lower indexer_max_chunk_size (e.g. to
8K). That avoids the OOM but costs prefill throughput for the common case,
where the vast majority of requests are far below these lengths — a blunt
one-size-fits-all setting for what is really a long-tail problem.

Summary

Select the indexer prefill chunk size per-batch based on the largest
compressed KV length (K_compressed) among the context requests, instead of
always using the statically-configured value:

max K_compressed in batch	effective chunk size
`> 512K`	8K
`[256K, 512K]`	16K
`< 256K`	configured (default 32K, unchanged)

Implementation:

New helper select_indexer_chunk_size(configured_chunk_size, max_k_compressed)
in dsa.py. It only ever reduces the configured chunk size (never
increases it), so any explicitly-configured value still acts as an upper
bound.
Wired into Indexer.prepare_for_chunked_prefill, on the indexer's own
chunking path only. The MLA chunked-prefill path is untouched (it already
bounds the chunk to the MLA chunk).
max K_compressed is read from indexer_params.kv_lens (already host-side
during prefill prepare).

Impact

Long requests (>256K): smaller chunk → lower indexer activation memory,
avoids the OOM without manual tuning.
Common case (<256K): behavior unchanged — keeps the larger,
higher-throughput chunk size.
Safe to vary per-batch because the prefill path does not use CUDA graphs.
No API/config changes; existing indexer_max_chunk_size continues to act as
the upper bound.

Notes

Draft for early review / discussion. Thresholds (256K / 512K) and chunk
sizes (8K / 16K / 32K) are centralized in _INDEXER_CHUNK_SIZE_HEURISTIC for
easy tuning.
TODO before un-drafting: add a unit test for select_indexer_chunk_size
boundaries and validate end-to-end memory/perf on a long-sequence workload.

🤖 Generated with Claude Code

The indexer's fp8_fp4_mqa_logits / fp8_mqa_logits activation memory scales with indexer_max_chunk_size * K_compressed (the compressed KV length of the current request). For very long sequences this can OOM: e.g. a ~500K-token request with the default 32K chunk size needs ~16GB of activation memory. The previous workaround was to uniformly lower indexer_max_chunk_size (e.g. to 8K), but that costs prefill throughput for the common case where the vast majority of requests are far below these lengths. Instead, select the indexer prefill chunk size per-batch based on the largest compressed KV length among the context requests: - max K_compressed > 512K -> 8K chunk - 256K <= max K_compressed <= 512K -> 16K chunk - max K_compressed < 256K -> configured chunk size (unchanged) The heuristic only ever reduces the configured chunk size (never increases it), so the common case keeps its larger, higher-throughput chunk. It is applied only on the indexer's own chunking path (not the MLA chunked-prefill path, which already bounds chunk size) and is safe to vary per-batch because the prefill path does not use CUDA graphs. Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>

lfr-0531 · 2026-06-17T13:02:44Z

Superseded by #15459 (branch renamed to user/fanrongl/dsv4-indexer-chunk-heuristic). Closing this one.

github-actions Bot assigned lfr-0531 Jun 17, 2026

lfr-0531 force-pushed the lfr/dsv4-indexer-chunk-heuristic branch from 7e710e5 to ff0e8a5 Compare June 17, 2026 13:01

lfr-0531 closed this Jun 17, 2026

lfr-0531 deleted the lfr/dsv4-indexer-chunk-heuristic branch June 17, 2026 13:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][feat] DSA: adaptive indexer prefill chunk size for long sequences#15458

[None][feat] DSA: adaptive indexer prefill chunk size for long sequences#15458
lfr-0531 wants to merge 1 commit into
NVIDIA:feat/deepseek_v4from
lfr-0531:lfr/dsv4-indexer-chunk-heuristic

lfr-0531 commented Jun 17, 2026

Uh oh!

lfr-0531 commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lfr-0531 commented Jun 17, 2026

Background / Motivation

Summary

Impact

Notes

Uh oh!

lfr-0531 commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant