[None][feat] DSA: adaptive indexer prefill chunk size for long sequences#15458
Closed
lfr-0531 wants to merge 1 commit into
Closed
[None][feat] DSA: adaptive indexer prefill chunk size for long sequences#15458lfr-0531 wants to merge 1 commit into
lfr-0531 wants to merge 1 commit into
Conversation
The indexer's fp8_fp4_mqa_logits / fp8_mqa_logits activation memory scales with indexer_max_chunk_size * K_compressed (the compressed KV length of the current request). For very long sequences this can OOM: e.g. a ~500K-token request with the default 32K chunk size needs ~16GB of activation memory. The previous workaround was to uniformly lower indexer_max_chunk_size (e.g. to 8K), but that costs prefill throughput for the common case where the vast majority of requests are far below these lengths. Instead, select the indexer prefill chunk size per-batch based on the largest compressed KV length among the context requests: - max K_compressed > 512K -> 8K chunk - 256K <= max K_compressed <= 512K -> 16K chunk - max K_compressed < 256K -> configured chunk size (unchanged) The heuristic only ever reduces the configured chunk size (never increases it), so the common case keeps its larger, higher-throughput chunk. It is applied only on the indexer's own chunking path (not the MLA chunked-prefill path, which already bounds chunk size) and is safe to vary per-batch because the prefill path does not use CUDA graphs. Signed-off-by: Fanrong Li <lfr-0531@users.noreply.github.com>
7e710e5 to
ff0e8a5
Compare
Collaborator
Author
|
Superseded by #15459 (branch renamed to user/fanrongl/dsv4-indexer-chunk-heuristic). Closing this one. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background / Motivation
The DSA indexer's
fp8_fp4_mqa_logits/fp8_mqa_logitsactivation memoryscales with
indexer_max_chunk_size * K_compressed(the compressed KV lengthof the current request). For very long sequences this can OOM — e.g. a
~500K-token request with the default 32K chunk size needs ~16GB of activation
memory.
The current workaround is to uniformly lower
indexer_max_chunk_size(e.g. to8K). That avoids the OOM but costs prefill throughput for the common case,
where the vast majority of requests are far below these lengths — a blunt
one-size-fits-all setting for what is really a long-tail problem.
Summary
Select the indexer prefill chunk size per-batch based on the largest
compressed KV length (
K_compressed) among the context requests, instead ofalways using the statically-configured value:
> 512K[256K, 512K]< 256KImplementation:
select_indexer_chunk_size(configured_chunk_size, max_k_compressed)in
dsa.py. It only ever reduces the configured chunk size (neverincreases it), so any explicitly-configured value still acts as an upper
bound.
Indexer.prepare_for_chunked_prefill, on the indexer's ownchunking path only. The MLA chunked-prefill path is untouched (it already
bounds the chunk to the MLA chunk).
max K_compressedis read fromindexer_params.kv_lens(already host-sideduring prefill prepare).
Impact
avoids the OOM without manual tuning.
higher-throughput chunk size.
indexer_max_chunk_sizecontinues to act asthe upper bound.
Notes
sizes (8K / 16K / 32K) are centralized in
_INDEXER_CHUNK_SIZE_HEURISTICforeasy tuning.
select_indexer_chunk_sizeboundaries and validate end-to-end memory/perf on a long-sequence workload.
🤖 Generated with Claude Code