[None][refactor] Refine Skip Softmax follow-ups by bobboli · Pull Request #15417 · NVIDIA/TensorRT-LLM

bobboli · 2026-06-16T15:06:45Z

Summary

Refine sparse attention backend selection to use lowered SparseParams instead of user-facing sparse config objects.
Move Skip Softmax Attention kernel parameters into AttentionForwardArgs, with the lightweight kernel-param carrier kept in the shared sparse params module.
Simplify attention metadata construction in PyExecutor and the layer-wise benchmark runner.
Keep hybrid Mamba / linear-attention models on the hybrid KV cache-manager route when Skip Softmax is enabled, while preserving sparse KV cache-manager routing for non-hybrid models.
Split Cosmos3 raw scheduler timestep from normalized transformer-forward timestep so Cosmos follows the VisualGen attention scheduling contract without changing its time embedding input.
Remove the unused WAN kwargs deletion.
Update the VisualGen SAGE attention test to call get_attention_backend() with lowered sparse_params, matching the refactored backend-selection API.
Use paged-context FMHA for standard attention whenever the KV cache uses FP8, so context QKV follows the supported FP8 preprocessing and kernel path.

Details

Sparse backend selection now happens from lowered params. get_attention_backend() takes sparse_params, and create_attention() derives the backend class internally instead of accepting a preselected attn_cls. This keeps user/API config parsing outside backend dispatch while still allowing sparse algorithms to select their backend class.
Skip Softmax kernel params are now carried through AttentionForwardArgs. TrtllmAttention resolves scheduler output into forward_args.skip_softmax_kernel_params, and the thop call reads from forward_args like the other per-forward inputs. The static thop sync test no longer needs a separate skip_softmax_kernel_params source class.
Metadata construction now calls the metadata class directly instead of packing metadata_kwargs and immediately unpacking it. The same cleanup is applied to the layer-wise benchmark runner.
- https://github.com/bobboli/TensorRT-LLM/blob/bli/skip-softmax-followups-placeholder/tensorrt_llm/_torch/pyexecutor/model_engine.py
- https://github.com/bobboli/TensorRT-LLM/blob/bli/skip-softmax-followups-placeholder/tensorrt_llm/tools/layer_wise_benchmarks/runner.py
Hybrid Mamba / linear-attention models now keep the Mamba-capable KV cache manager when the sparse attention algorithm is Skip Softmax. Skip Softmax changes attention kernel execution but does not replace the recurrent-state cache manager required by hybrid models. Other sparse attention algorithms still fail early for this combination, and non-hybrid sparse models continue to use their sparse KV cache-manager route.
- https://github.com/bobboli/TensorRT-LLM/blob/bli/skip-softmax-followups-placeholder/tensorrt_llm/_torch/pyexecutor/_util.py
Cosmos3 now passes normalized timestep to the transformer/attention path and raw_timestep to the Cosmos time embedding path. This keeps scheduler-level/raw-time semantics available to Cosmos while preserving the VisualGen convention that the transformer-forward timestep consumed by attention scheduling is normalized. The audio transformer paths added by [TRTLLM-13120][feat] Cosmos3 Audio Output Support #14827 and their single- and multi-GPU tests follow the same contract.
WAN no longer explicitly deletes unused kwargs; the forward signature still accepts them for compatibility.
- https://github.com/bobboli/TensorRT-LLM/blob/bli/skip-softmax-followups-placeholder/tensorrt_llm/_torch/visual_gen/models/wan/transformer_wan.py
The VisualGen SAGE attention test now passes lowered sparse_params into get_attention_backend(), matching the sparse backend-selection API introduced by this PR.
- https://github.com/bobboli/TensorRT-LLM/blob/bli/skip-softmax-followups-placeholder/tests/unittest/_torch/visual_gen/test_attention_trtllm_sage.py
Standard TRTLLM attention now enables paged-context FMHA whenever its KV cache uses FP8. The existing C++ attention op uses this mode to select FP8 context FMHA and quantize context QKV before kernel dispatch. The rule applies independently of sparse-attention configuration; MLA keeps its separate context path.
- https://github.com/bobboli/TensorRT-LLM/blob/bli/skip-softmax-followups-placeholder/tensorrt_llm/_torch/attention_backend/trtllm.py

bobboli · 2026-06-25T16:12:37Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-25T16:18:41Z

PR_Github #55827 [ run ] triggered by Bot. Commit: 116eb3b Link to invocation

coderabbitai · 2026-06-25T16:22:05Z

📝 Walkthrough

Walkthrough

The PR introduces shared SkipSoftmaxKernelParams plumbing, routes sparse attention backend selection through SparseParams, and updates Cosmos3 forward calls to pass both normalized timestep and raw_timestep. Related tests and executor wiring were updated.

Changes

Sparse attention parameter plumbing

Layer / File(s)	Summary
Skip-softmax parameter contract `tensorrt_llm/_torch/attention_backend/sparse/params.py`, `tensorrt_llm/_torch/attention_backend/sparse/skip_softmax.py`, `tensorrt_llm/_torch/attention_backend/interface.py`, `tests/unittest/_torch/attention_backend/test_attention_op_sync.py`	`SkipSoftmaxKernelParams` is defined centrally, surfaced on `AttentionForwardArgs`, and the sync test resolves the new skip-softmax kwargs through `forward_args.skip_softmax_kernel_params`.
Sparse backend selection `tensorrt_llm/_torch/attention_backend/sparse/utils.py`, `tensorrt_llm/_torch/attention_backend/utils.py`, `tensorrt_llm/_torch/modules/attention.py`, `tests/unittest/_torch/attention/sparse/test_short_seq_mha.py`, `tests/unittest/_torch/attention/sparse/test_sparse_mla_forward.py`	Sparse backend helpers and attention constructors now accept `SparseParams` and resolve backend classes internally, and the sparse attention tests derive backend params before selecting TRTLLM implementations.
Skip-softmax runtime wiring `tensorrt_llm/_torch/attention_backend/trtllm.py`, `tensorrt_llm/_torch/attention_backend/fmha/fallback.py`, `tensorrt_llm/_torch/attention_backend/fmha/flashinfer_trtllm_gen.py`	`TrtllmAttention.forward()` writes `forward_args.skip_softmax_kernel_params`, and FMHA fallback plus FlashInfer support checks read the new skip-softmax fields.
Executor sparse wiring `tensorrt_llm/_torch/pyexecutor/_util.py`, `tensorrt_llm/_torch/pyexecutor/model_engine.py`, `tensorrt_llm/tools/layer_wise_benchmarks/runner.py`	PyExecutor and benchmark attention setup derive backend selection and metadata from `sparse_params`.

Cosmos3 timestep plumbing

Layer / File(s)	Summary
Forward timestep API `tensorrt_llm/_torch/visual_gen/models/cosmos3/transformer_cosmos3.py`, `tensorrt_llm/_torch/visual_gen/models/cosmos3/pipeline_cosmos3.py`	Cosmos3 forward methods accept normalized `timestep` plus `raw_timestep`, use the raw value for time embedding, and pass the normalized value downstream.
Timestep test updates `tests/unittest/_torch/visual_gen/test_cosmos3_transformer.py`, `tests/unittest/_torch/visual_gen/multi_gpu/test_cosmos3_transformer_parallel.py`	Cosmos3 unit and distributed tests normalize timestep inputs and pass `raw_timestep` into the updated forward calls.

Sequence Diagram(s)

Sparse attention flow

sequenceDiagram
  participant AttentionInit as Attention.__init__
  participant BackendUtils as get_attention_backend
  participant CreateAttention as create_attention
  participant TrtllmForward as TrtllmAttention.forward
  participant SkipScheduler as SkipSoftmaxScheduler
  participant ForwardArgs as AttentionForwardArgs
  participant FallbackForward as FallbackFmha.forward
  AttentionInit->>BackendUtils: select backend with sparse_params
  AttentionInit->>CreateAttention: build attention module
  CreateAttention->>BackendUtils: resolve backend class from sparse_params
  TrtllmForward->>SkipScheduler: get_kernel_params(timestep)
  SkipScheduler-->>TrtllmForward: SkipSoftmaxKernelParams
  TrtllmForward->>ForwardArgs: set skip_softmax_kernel_params
  FallbackForward->>ForwardArgs: read skip_softmax_threshold_scale_factor_*

Cosmos3 timestep flow

sequenceDiagram
  participant Pipeline as Cosmos3OmniMoTPipeline.forward
  participant Transformer as Cosmos3VFMTransformer.forward
  participant TimeEmbedder as time_embedder
  participant LanguageModel as self.language_model
  participant DecoderLayer as GEN decoder layer
  Pipeline->>Transformer: pass timestep and raw_timestep
  Transformer->>TimeEmbedder: embed raw_timestep
  Transformer->>LanguageModel: forward timestep=...
  Transformer->>DecoderLayer: forward timestep=...

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

NVIDIA/TensorRT-LLM#15204: Also changes FMHA and FlashInfer attention plumbing around skip-softmax handling and related backend structure.
NVIDIA/TensorRT-LLM#15545: Also changes Cosmos3 timestep plumbing on the Cosmos3VFMTransformer call path.

Suggested reviewers

QiJune
yuxianq
brb-nv
chang-l

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 44.44% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title is concise and accurately summarizes the main Skip Softmax follow-up refactor.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The description is detailed and covers the main changes, but it omits the template's Test Coverage, PR Checklist, and PR title sections.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

tensorrt-cicd · 2026-06-25T23:39:51Z

PR_Github #55827 [ run ] completed with state FAILURE. Commit: 116eb3b
/LLM/main/L0_MergeRequest_PR pipeline #44718 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

bobboli · 2026-06-26T01:23:43Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-26T01:30:20Z

PR_Github #55928 [ run ] triggered by Bot. Commit: cde7ef7 Link to invocation

tensorrt-cicd · 2026-06-26T08:37:04Z

PR_Github #55928 [ run ] completed with state SUCCESS. Commit: cde7ef7
/LLM/main/L0_MergeRequest_PR pipeline #44810 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

bobboli · 2026-06-26T09:28:02Z

/bot run --reuse-test

tensorrt-cicd · 2026-06-26T09:33:41Z

PR_Github #56056 [ run ] triggered by Bot. Commit: cde7ef7 Link to invocation

tensorrt-cicd · 2026-06-26T16:04:55Z

PR_Github #56056 [ run ] completed with state SUCCESS. Commit: cde7ef7
/LLM/main/L0_MergeRequest_PR pipeline #44925 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

bobboli · 2026-06-28T11:19:41Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-28T11:25:10Z

PR_Github #56227 [ run ] triggered by Bot. Commit: e2c83ec Link to invocation

tensorrt-cicd · 2026-06-28T18:16:16Z

PR_Github #56227 [ run ] completed with state SUCCESS. Commit: e2c83ec
/LLM/main/L0_MergeRequest_PR pipeline #45089 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

bobboli · 2026-07-01T03:26:39Z

/bot run --disable-fail-fast --reuse-test

tensorrt-cicd · 2026-07-01T03:32:01Z

PR_Github #56813 [ run ] triggered by Bot. Commit: 3e5e59a Link to invocation

tensorrt-cicd · 2026-07-01T09:09:49Z

PR_Github #56813 [ run ] completed with state SUCCESS. Commit: 3e5e59a
/LLM/main/L0_MergeRequest_PR pipeline #45627 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

bobboli · 2026-07-02T10:41:14Z

/bot run --disable-fail-fast

bobboli · 2026-07-02T14:18:39Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-07-02T14:25:11Z

PR_Github #57195 [ run ] triggered by Bot. Commit: cfbe0b4 Link to invocation

tensorrt-cicd · 2026-07-02T19:56:15Z

PR_Github #57195 [ run ] completed with state SUCCESS. Commit: cfbe0b4
/LLM/main/L0_MergeRequest_PR pipeline #45967 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

bobboli · 2026-07-03T02:36:50Z

/bot run --reuse-test

tensorrt-cicd · 2026-07-03T02:44:01Z

PR_Github #57318 [ run ] triggered by Bot. Commit: cfbe0b4 Link to invocation

tensorrt-cicd · 2026-07-03T03:00:44Z

PR_Github #57318 [ run ] completed with state FAILURE. Commit: cfbe0b4
/LLM/main/L0_MergeRequest_PR pipeline #46078 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

bobboli · 2026-07-03T04:33:31Z

/bot run --reuse-test

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

bobboli · 2026-07-03T09:41:17Z

/bot run --disable-fail-fast

yiqingy0 · 2026-07-04T05:52:38Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-07-04T05:57:58Z

PR_Github #57542 [ run ] triggered by Bot. Commit: 9717e24 Link to invocation

github-actions Bot assigned bobboli Jun 16, 2026

bobboli force-pushed the bli/skip-softmax-followups-placeholder branch from 52c2c17 to 1f4197b Compare June 17, 2026 03:40

bobboli changed the title ~~[None][follow-up] Placeholder for Skip Softmax follow-ups~~ [None][refactor] Refine Skip Softmax follow-ups Jun 17, 2026

bobboli force-pushed the bli/skip-softmax-followups-placeholder branch from 1f4197b to bf744a9 Compare June 17, 2026 03:41

bobboli mentioned this pull request Jun 25, 2026

[None][fix] Keep hybrid KV routing for Skip Softmax #15403

Closed

bobboli force-pushed the bli/skip-softmax-followups-placeholder branch from 1cbe0f3 to ca197b0 Compare June 25, 2026 12:12

bobboli marked this pull request as ready for review June 25, 2026 16:12

bobboli requested review from a team as code owners June 25, 2026 16:12

bobboli requested review from lancelly and pengbowang-nv June 25, 2026 16:12

bobboli mentioned this pull request Jun 25, 2026

[TRTLLM-13120][feat] Cosmos3 Audio Output Support #14827

Merged

1 task

bobboli force-pushed the bli/skip-softmax-followups-placeholder branch from 70fea9f to cde7ef7 Compare June 26, 2026 01:17

bobboli requested review from chang-l and yuxianq June 26, 2026 04:39

bobboli force-pushed the bli/skip-softmax-followups-placeholder branch from cde7ef7 to e2c83ec Compare June 28, 2026 09:48

bobboli force-pushed the bli/skip-softmax-followups-placeholder branch from 9233deb to 3e5e59a Compare July 1, 2026 03:25

bobboli force-pushed the bli/skip-softmax-followups-placeholder branch from 3e5e59a to cfbe0b4 Compare July 2, 2026 10:36

bobboli requested a review from a team as a code owner July 2, 2026 10:36

bobboli requested a review from xxi-nv July 2, 2026 10:36

xxi-nv requested review from QiJune and removed request for xxi-nv July 3, 2026 00:38

bobboli added 8 commits July 3, 2026 17:10

Refine sparse attention backend plumbing

d4b73e8

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

[None][fix] Keep hybrid KV routing for Skip Softmax

c93790d

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

[None][fix] Split Cosmos raw and normalized timesteps

234a8bf

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

[None][fix] Update SAGE test sparse backend lookup

231320b

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

[None][fix] Lower DeepSeek V4 sparse params consistently

ae08c59

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

[None][fix] Use paged context FMHA with FP8 KV cache

1c523a5

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

[None][fix] Clarify FP8 context routing

754b339

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

[None][test] Update Cosmos3 audio timestep inputs

9717e24

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>

bobboli force-pushed the bli/skip-softmax-followups-placeholder branch from cfbe0b4 to 9717e24 Compare July 3, 2026 09:10

yuxianq approved these changes Jul 3, 2026

View reviewed changes

Uh oh!

Conversation

bobboli commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Uh oh!

bobboli commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

coderabbitai Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

bobboli commented Jun 26, 2026

Uh oh!

tensorrt-cicd commented Jun 26, 2026

Uh oh!

tensorrt-cicd commented Jun 26, 2026

Uh oh!

bobboli commented Jun 26, 2026

Uh oh!

tensorrt-cicd commented Jun 26, 2026

Uh oh!

tensorrt-cicd commented Jun 26, 2026

Uh oh!

bobboli commented Jun 28, 2026

Uh oh!

tensorrt-cicd commented Jun 28, 2026

Uh oh!

tensorrt-cicd commented Jun 28, 2026

Uh oh!

bobboli commented Jul 1, 2026

Uh oh!

tensorrt-cicd commented Jul 1, 2026

Uh oh!

tensorrt-cicd commented Jul 1, 2026

Uh oh!

bobboli commented Jul 2, 2026

Uh oh!

bobboli commented Jul 2, 2026

Uh oh!

tensorrt-cicd commented Jul 2, 2026

Uh oh!

tensorrt-cicd commented Jul 2, 2026

Uh oh!

bobboli commented Jul 3, 2026

Uh oh!

tensorrt-cicd commented Jul 3, 2026

Uh oh!

tensorrt-cicd commented Jul 3, 2026

Uh oh!

bobboli commented Jul 3, 2026

Uh oh!

bobboli commented Jul 3, 2026

Uh oh!

yiqingy0 commented Jul 4, 2026

Uh oh!

tensorrt-cicd commented Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

bobboli commented Jun 16, 2026 •

edited

Loading

coderabbitai Bot commented Jun 25, 2026 •

edited

Loading