Skip to content

[None][refactor] Refine Skip Softmax follow-ups#15417

Open
bobboli wants to merge 8 commits into
NVIDIA:mainfrom
bobboli:bli/skip-softmax-followups-placeholder
Open

[None][refactor] Refine Skip Softmax follow-ups#15417
bobboli wants to merge 8 commits into
NVIDIA:mainfrom
bobboli:bli/skip-softmax-followups-placeholder

Conversation

@bobboli

@bobboli bobboli commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Summary

  1. Refine sparse attention backend selection to use lowered SparseParams instead of user-facing sparse config objects.
  2. Move Skip Softmax Attention kernel parameters into AttentionForwardArgs, with the lightweight kernel-param carrier kept in the shared sparse params module.
  3. Simplify attention metadata construction in PyExecutor and the layer-wise benchmark runner.
  4. Keep hybrid Mamba / linear-attention models on the hybrid KV cache-manager route when Skip Softmax is enabled, while preserving sparse KV cache-manager routing for non-hybrid models.
  5. Split Cosmos3 raw scheduler timestep from normalized transformer-forward timestep so Cosmos follows the VisualGen attention scheduling contract without changing its time embedding input.
  6. Remove the unused WAN kwargs deletion.
  7. Update the VisualGen SAGE attention test to call get_attention_backend() with lowered sparse_params, matching the refactored backend-selection API.
  8. Use paged-context FMHA for standard attention whenever the KV cache uses FP8, so context QKV follows the supported FP8 preprocessing and kernel path.

Details

  1. Sparse backend selection now happens from lowered params. get_attention_backend() takes sparse_params, and create_attention() derives the backend class internally instead of accepting a preselected attn_cls. This keeps user/API config parsing outside backend dispatch while still allowing sparse algorithms to select their backend class.

  2. Skip Softmax kernel params are now carried through AttentionForwardArgs. TrtllmAttention resolves scheduler output into forward_args.skip_softmax_kernel_params, and the thop call reads from forward_args like the other per-forward inputs. The static thop sync test no longer needs a separate skip_softmax_kernel_params source class.

  3. Metadata construction now calls the metadata class directly instead of packing metadata_kwargs and immediately unpacking it. The same cleanup is applied to the layer-wise benchmark runner.

  4. Hybrid Mamba / linear-attention models now keep the Mamba-capable KV cache manager when the sparse attention algorithm is Skip Softmax. Skip Softmax changes attention kernel execution but does not replace the recurrent-state cache manager required by hybrid models. Other sparse attention algorithms still fail early for this combination, and non-hybrid sparse models continue to use their sparse KV cache-manager route.

  5. Cosmos3 now passes normalized timestep to the transformer/attention path and raw_timestep to the Cosmos time embedding path. This keeps scheduler-level/raw-time semantics available to Cosmos while preserving the VisualGen convention that the transformer-forward timestep consumed by attention scheduling is normalized. The audio transformer paths added by [TRTLLM-13120][feat] Cosmos3 Audio Output Support #14827 and their single- and multi-GPU tests follow the same contract.

  6. WAN no longer explicitly deletes unused kwargs; the forward signature still accepts them for compatibility.

  7. The VisualGen SAGE attention test now passes lowered sparse_params into get_attention_backend(), matching the sparse backend-selection API introduced by this PR.

  8. Standard TRTLLM attention now enables paged-context FMHA whenever its KV cache uses FP8. The existing C++ attention op uses this mode to select FP8 context FMHA and quantize context QKV before kernel dispatch. The rule applies independently of sparse-attention configuration; MLA keeps its separate context path.

@bobboli bobboli force-pushed the bli/skip-softmax-followups-placeholder branch from 52c2c17 to 1f4197b Compare June 17, 2026 03:40
@bobboli bobboli changed the title [None][follow-up] Placeholder for Skip Softmax follow-ups [None][refactor] Refine Skip Softmax follow-ups Jun 17, 2026
@bobboli bobboli force-pushed the bli/skip-softmax-followups-placeholder branch from 1f4197b to bf744a9 Compare June 17, 2026 03:41
@bobboli bobboli force-pushed the bli/skip-softmax-followups-placeholder branch from 1cbe0f3 to ca197b0 Compare June 25, 2026 12:12
@bobboli bobboli marked this pull request as ready for review June 25, 2026 16:12
@bobboli bobboli requested review from a team as code owners June 25, 2026 16:12
@bobboli

bobboli commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55827 [ run ] triggered by Bot. Commit: 116eb3b Link to invocation

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

The PR introduces shared SkipSoftmaxKernelParams plumbing, routes sparse attention backend selection through SparseParams, and updates Cosmos3 forward calls to pass both normalized timestep and raw_timestep. Related tests and executor wiring were updated.

Changes

Sparse attention parameter plumbing

Layer / File(s) Summary
Skip-softmax parameter contract
tensorrt_llm/_torch/attention_backend/sparse/params.py, tensorrt_llm/_torch/attention_backend/sparse/skip_softmax.py, tensorrt_llm/_torch/attention_backend/interface.py, tests/unittest/_torch/attention_backend/test_attention_op_sync.py
SkipSoftmaxKernelParams is defined centrally, surfaced on AttentionForwardArgs, and the sync test resolves the new skip-softmax kwargs through forward_args.skip_softmax_kernel_params.
Sparse backend selection
tensorrt_llm/_torch/attention_backend/sparse/utils.py, tensorrt_llm/_torch/attention_backend/utils.py, tensorrt_llm/_torch/modules/attention.py, tests/unittest/_torch/attention/sparse/test_short_seq_mha.py, tests/unittest/_torch/attention/sparse/test_sparse_mla_forward.py
Sparse backend helpers and attention constructors now accept SparseParams and resolve backend classes internally, and the sparse attention tests derive backend params before selecting TRTLLM implementations.
Skip-softmax runtime wiring
tensorrt_llm/_torch/attention_backend/trtllm.py, tensorrt_llm/_torch/attention_backend/fmha/fallback.py, tensorrt_llm/_torch/attention_backend/fmha/flashinfer_trtllm_gen.py
TrtllmAttention.forward() writes forward_args.skip_softmax_kernel_params, and FMHA fallback plus FlashInfer support checks read the new skip-softmax fields.
Executor sparse wiring
tensorrt_llm/_torch/pyexecutor/_util.py, tensorrt_llm/_torch/pyexecutor/model_engine.py, tensorrt_llm/tools/layer_wise_benchmarks/runner.py
PyExecutor and benchmark attention setup derive backend selection and metadata from sparse_params.

Cosmos3 timestep plumbing

Layer / File(s) Summary
Forward timestep API
tensorrt_llm/_torch/visual_gen/models/cosmos3/transformer_cosmos3.py, tensorrt_llm/_torch/visual_gen/models/cosmos3/pipeline_cosmos3.py
Cosmos3 forward methods accept normalized timestep plus raw_timestep, use the raw value for time embedding, and pass the normalized value downstream.
Timestep test updates
tests/unittest/_torch/visual_gen/test_cosmos3_transformer.py, tests/unittest/_torch/visual_gen/multi_gpu/test_cosmos3_transformer_parallel.py
Cosmos3 unit and distributed tests normalize timestep inputs and pass raw_timestep into the updated forward calls.

Sequence Diagram(s)

Sparse attention flow

sequenceDiagram
  participant AttentionInit as Attention.__init__
  participant BackendUtils as get_attention_backend
  participant CreateAttention as create_attention
  participant TrtllmForward as TrtllmAttention.forward
  participant SkipScheduler as SkipSoftmaxScheduler
  participant ForwardArgs as AttentionForwardArgs
  participant FallbackForward as FallbackFmha.forward
  AttentionInit->>BackendUtils: select backend with sparse_params
  AttentionInit->>CreateAttention: build attention module
  CreateAttention->>BackendUtils: resolve backend class from sparse_params
  TrtllmForward->>SkipScheduler: get_kernel_params(timestep)
  SkipScheduler-->>TrtllmForward: SkipSoftmaxKernelParams
  TrtllmForward->>ForwardArgs: set skip_softmax_kernel_params
  FallbackForward->>ForwardArgs: read skip_softmax_threshold_scale_factor_*
Loading

Cosmos3 timestep flow

sequenceDiagram
  participant Pipeline as Cosmos3OmniMoTPipeline.forward
  participant Transformer as Cosmos3VFMTransformer.forward
  participant TimeEmbedder as time_embedder
  participant LanguageModel as self.language_model
  participant DecoderLayer as GEN decoder layer
  Pipeline->>Transformer: pass timestep and raw_timestep
  Transformer->>TimeEmbedder: embed raw_timestep
  Transformer->>LanguageModel: forward timestep=...
  Transformer->>DecoderLayer: forward timestep=...
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • QiJune
  • yuxianq
  • brb-nv
  • chang-l
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 44.44% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and accurately summarizes the main Skip Softmax follow-up refactor.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The description is detailed and covers the main changes, but it omits the template's Test Coverage, PR Checklist, and PR title sections.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55827 [ run ] completed with state FAILURE. Commit: 116eb3b
/LLM/main/L0_MergeRequest_PR pipeline #44718 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@bobboli bobboli force-pushed the bli/skip-softmax-followups-placeholder branch from 70fea9f to cde7ef7 Compare June 26, 2026 01:17
@bobboli

bobboli commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55928 [ run ] triggered by Bot. Commit: cde7ef7 Link to invocation

@bobboli bobboli requested review from chang-l and yuxianq June 26, 2026 04:39
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55928 [ run ] completed with state SUCCESS. Commit: cde7ef7
/LLM/main/L0_MergeRequest_PR pipeline #44810 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@bobboli

bobboli commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --reuse-test

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56056 [ run ] triggered by Bot. Commit: cde7ef7 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56056 [ run ] completed with state SUCCESS. Commit: cde7ef7
/LLM/main/L0_MergeRequest_PR pipeline #44925 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@bobboli bobboli force-pushed the bli/skip-softmax-followups-placeholder branch from cde7ef7 to e2c83ec Compare June 28, 2026 09:48
@bobboli

bobboli commented Jun 28, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56227 [ run ] triggered by Bot. Commit: e2c83ec Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56227 [ run ] completed with state SUCCESS. Commit: e2c83ec
/LLM/main/L0_MergeRequest_PR pipeline #45089 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@bobboli bobboli force-pushed the bli/skip-softmax-followups-placeholder branch from 9233deb to 3e5e59a Compare July 1, 2026 03:25
@bobboli

bobboli commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --reuse-test

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56813 [ run ] triggered by Bot. Commit: 3e5e59a Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #56813 [ run ] completed with state SUCCESS. Commit: 3e5e59a
/LLM/main/L0_MergeRequest_PR pipeline #45627 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@bobboli bobboli force-pushed the bli/skip-softmax-followups-placeholder branch from 3e5e59a to cfbe0b4 Compare July 2, 2026 10:36
@bobboli bobboli requested a review from a team as a code owner July 2, 2026 10:36
@bobboli bobboli requested a review from xxi-nv July 2, 2026 10:36
@bobboli

bobboli commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

1 similar comment
@bobboli

bobboli commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57195 [ run ] triggered by Bot. Commit: cfbe0b4 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57195 [ run ] completed with state SUCCESS. Commit: cfbe0b4
/LLM/main/L0_MergeRequest_PR pipeline #45967 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@xxi-nv xxi-nv requested review from QiJune and removed request for xxi-nv July 3, 2026 00:38
@bobboli

bobboli commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --reuse-test

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57318 [ run ] triggered by Bot. Commit: cfbe0b4 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57318 [ run ] completed with state FAILURE. Commit: cfbe0b4
/LLM/main/L0_MergeRequest_PR pipeline #46078 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@bobboli

bobboli commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --reuse-test

bobboli added 8 commits July 3, 2026 17:10
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
@bobboli bobboli force-pushed the bli/skip-softmax-followups-placeholder branch from cfbe0b4 to 9717e24 Compare July 3, 2026 09:10
@bobboli

bobboli commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@yiqingy0

yiqingy0 commented Jul 4, 2026

Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #57542 [ run ] triggered by Bot. Commit: 9717e24 Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants