Skip to content

[None][feat] DSA: adaptive indexer prefill chunk size for long sequences#15459

Draft
lfr-0531 wants to merge 1 commit into
NVIDIA:feat/deepseek_v4from
lfr-0531:user/fanrongl/dsv4-indexer-chunk-heuristic
Draft

[None][feat] DSA: adaptive indexer prefill chunk size for long sequences#15459
lfr-0531 wants to merge 1 commit into
NVIDIA:feat/deepseek_v4from
lfr-0531:user/fanrongl/dsv4-indexer-chunk-heuristic

Conversation

@lfr-0531

Copy link
Copy Markdown
Collaborator

Background / Motivation

The DSA indexer's fp8_fp4_mqa_logits / fp8_mqa_logits activation memory
scales with indexer_max_chunk_size * K_compressed (the compressed KV length
of the current request). For very long sequences this can OOM — e.g. a
~500K-token request with the default 32K chunk size needs ~16GB of activation
memory.

The current workaround is to uniformly lower indexer_max_chunk_size (e.g. to
8K). That avoids the OOM but costs prefill throughput for the common case,
where the vast majority of requests are far below these lengths — a blunt
one-size-fits-all setting for what is really a long-tail problem.

Summary

Select the indexer prefill chunk size per-batch based on the largest
compressed KV length (K_compressed) among the context requests, instead of
always using the statically-configured value:

max K_compressed in batch effective chunk size
> 512K 8K
[256K, 512K] 16K
< 256K configured (default 32K, unchanged)

Implementation:

  • New helper select_indexer_chunk_size(configured_chunk_size, max_k_compressed)
    in dsa.py. It only ever reduces the configured chunk size (never
    increases it), so any explicitly-configured value still acts as an upper
    bound.
  • Wired into Indexer.prepare_for_chunked_prefill, on the indexer's own
    chunking path only. The MLA chunked-prefill path is untouched (it already
    bounds the chunk to the MLA chunk).
  • max K_compressed is read from indexer_params.kv_lens (already host-side
    during prefill prepare).

Impact

  • Long requests (>256K): smaller chunk → lower indexer activation memory,
    avoids the OOM without manual tuning.
  • Common case (<256K): behavior unchanged — keeps the larger,
    higher-throughput chunk size.
  • Safe to vary per-batch because the prefill path does not use CUDA graphs.
  • No API/config changes; existing indexer_max_chunk_size continues to act as
    the upper bound.

Notes

  • Draft for early review / discussion. Thresholds (256K / 512K) and chunk
    sizes (8K / 16K / 32K) are centralized in _INDEXER_CHUNK_SIZE_HEURISTIC for
    easy tuning.
  • TODO before un-drafting: add a unit test for select_indexer_chunk_size
    boundaries and validate end-to-end memory/perf on a long-sequence workload.

🤖 Generated with Claude Code

The indexer's fp8_fp4_mqa_logits / fp8_mqa_logits activation memory scales
with indexer_max_chunk_size * K_compressed (the compressed KV length of the
current request). For very long sequences this can OOM: e.g. a ~500K-token
request with the default 32K chunk size needs ~16GB of activation memory.

The previous workaround was to uniformly lower indexer_max_chunk_size (e.g.
to 8K), but that costs prefill throughput for the common case where the vast
majority of requests are far below these lengths.

Instead, select the indexer prefill chunk size per-batch based on the largest
compressed KV length among the context requests:

  - max K_compressed >  512K          -> 8K  chunk
  - 256K <= max K_compressed <= 512K  -> 16K chunk
  - max K_compressed <  256K          -> configured chunk size (unchanged)

The heuristic only ever reduces the configured chunk size (never increases
it), so the common case keeps its larger, higher-throughput chunk. It is
applied only on the indexer's own chunking path (not the MLA chunked-prefill
path, which already bounds chunk size) and is safe to vary per-batch because
the prefill path does not use CUDA graphs.

Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
@lfr-0531

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54853 [ run ] triggered by Bot. Commit: ff0e8a5 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54853 [ run ] completed with state SUCCESS. Commit: ff0e8a5
/LLM/main/L0_MergeRequest_PR pipeline #43862 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants