[None][refactor] Refine Skip Softmax follow-ups#15417
Conversation
52c2c17 to
1f4197b
Compare
1f4197b to
bf744a9
Compare
1cbe0f3 to
ca197b0
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #55827 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughThe PR introduces shared ChangesSparse attention parameter plumbing
Cosmos3 timestep plumbing
Sequence Diagram(s)Sparse attention flow sequenceDiagram
participant AttentionInit as Attention.__init__
participant BackendUtils as get_attention_backend
participant CreateAttention as create_attention
participant TrtllmForward as TrtllmAttention.forward
participant SkipScheduler as SkipSoftmaxScheduler
participant ForwardArgs as AttentionForwardArgs
participant FallbackForward as FallbackFmha.forward
AttentionInit->>BackendUtils: select backend with sparse_params
AttentionInit->>CreateAttention: build attention module
CreateAttention->>BackendUtils: resolve backend class from sparse_params
TrtllmForward->>SkipScheduler: get_kernel_params(timestep)
SkipScheduler-->>TrtllmForward: SkipSoftmaxKernelParams
TrtllmForward->>ForwardArgs: set skip_softmax_kernel_params
FallbackForward->>ForwardArgs: read skip_softmax_threshold_scale_factor_*
Cosmos3 timestep flow sequenceDiagram
participant Pipeline as Cosmos3OmniMoTPipeline.forward
participant Transformer as Cosmos3VFMTransformer.forward
participant TimeEmbedder as time_embedder
participant LanguageModel as self.language_model
participant DecoderLayer as GEN decoder layer
Pipeline->>Transformer: pass timestep and raw_timestep
Transformer->>TimeEmbedder: embed raw_timestep
Transformer->>LanguageModel: forward timestep=...
Transformer->>DecoderLayer: forward timestep=...
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
PR_Github #55827 [ run ] completed with state
|
70fea9f to
cde7ef7
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #55928 [ run ] triggered by Bot. Commit: |
|
PR_Github #55928 [ run ] completed with state
|
|
/bot run --reuse-test |
|
PR_Github #56056 [ run ] triggered by Bot. Commit: |
|
PR_Github #56056 [ run ] completed with state
|
cde7ef7 to
e2c83ec
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #56227 [ run ] triggered by Bot. Commit: |
|
PR_Github #56227 [ run ] completed with state
|
9233deb to
3e5e59a
Compare
|
/bot run --disable-fail-fast --reuse-test |
|
PR_Github #56813 [ run ] triggered by Bot. Commit: |
|
PR_Github #56813 [ run ] completed with state
|
3e5e59a to
cfbe0b4
Compare
|
/bot run --disable-fail-fast |
1 similar comment
|
/bot run --disable-fail-fast |
|
PR_Github #57195 [ run ] triggered by Bot. Commit: |
|
PR_Github #57195 [ run ] completed with state
|
|
/bot run --reuse-test |
|
PR_Github #57318 [ run ] triggered by Bot. Commit: |
|
PR_Github #57318 [ run ] completed with state
|
|
/bot run --reuse-test |
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
cfbe0b4 to
9717e24
Compare
|
/bot run --disable-fail-fast |
|
/bot run --disable-fail-fast |
|
PR_Github #57542 [ run ] triggered by Bot. Commit: |
Summary
SparseParamsinstead of user-facing sparse config objects.AttentionForwardArgs, with the lightweight kernel-param carrier kept in the shared sparse params module.kwargsdeletion.get_attention_backend()with loweredsparse_params, matching the refactored backend-selection API.Details
Sparse backend selection now happens from lowered params.
get_attention_backend()takessparse_params, andcreate_attention()derives the backend class internally instead of accepting a preselectedattn_cls. This keeps user/API config parsing outside backend dispatch while still allowing sparse algorithms to select their backend class.Skip Softmax kernel params are now carried through
AttentionForwardArgs.TrtllmAttentionresolves scheduler output intoforward_args.skip_softmax_kernel_params, and the thop call reads fromforward_argslike the other per-forward inputs. The static thop sync test no longer needs a separateskip_softmax_kernel_paramssource class.Metadata construction now calls the metadata class directly instead of packing
metadata_kwargsand immediately unpacking it. The same cleanup is applied to the layer-wise benchmark runner.Hybrid Mamba / linear-attention models now keep the Mamba-capable KV cache manager when the sparse attention algorithm is Skip Softmax. Skip Softmax changes attention kernel execution but does not replace the recurrent-state cache manager required by hybrid models. Other sparse attention algorithms still fail early for this combination, and non-hybrid sparse models continue to use their sparse KV cache-manager route.
Cosmos3 now passes normalized
timestepto the transformer/attention path andraw_timestepto the Cosmos time embedding path. This keeps scheduler-level/raw-time semantics available to Cosmos while preserving the VisualGen convention that the transformer-forwardtimestepconsumed by attention scheduling is normalized. The audio transformer paths added by [TRTLLM-13120][feat] Cosmos3 Audio Output Support #14827 and their single- and multi-GPU tests follow the same contract.WAN no longer explicitly deletes unused
kwargs; the forward signature still accepts them for compatibility.The VisualGen SAGE attention test now passes lowered
sparse_paramsintoget_attention_backend(), matching the sparse backend-selection API introduced by this PR.Standard TRTLLM attention now enables paged-context FMHA whenever its KV cache uses FP8. The existing C++ attention op uses this mode to select FP8 context FMHA and quantize context QKV before kernel dispatch. The rule applies independently of sparse-attention configuration; MLA keeps its separate context path.