[FMHA] gfx950 dualwave SWP with split-K, varlen, and arbitrary seq_len#681
Closed
yanguahe wants to merge 2 commits into
Closed
[FMHA] gfx950 dualwave SWP with split-K, varlen, and arbitrary seq_len#681yanguahe wants to merge 2 commits into
yanguahe wants to merge 2 commits into
Conversation
- Add flash_attn_dualwave_swp_gfx950_kernel with lazy-rescale, s_setprio stagger, split-K combine path, and buffer_store_dwordx4 O-store - Support packed QKV varlen via cu_seqlens; arbitrary seq_len >= 1 on both dualwave and generic fallback paths with padding masks - Update flash_attn_generic dispatch, seq_len guard, and varlen routing - Extend test_flash_attn_fwd with split-K, varlen configs, OPUS/aiter compare Ported from opus_align FMHA optimization work onto rocm/main base. Co-authored-by: Cursor <cursoragent@cursor.com>
Collaborator
|
CI FAILED @yanguahe |
The generic flash_attn O-store used permlane32_swap and cvt_pk_bf16_f32 (both gfx950/CDNA4-only) unconditionally. On gfx942 (CDNA3) the gfx950 dualwave fast path is disabled and flash_attn falls back to the generic kernel, so the backend hit "Cannot select intrinsic llvm.amdgcn.permlane32.swap" and aborted (CI: test linux-flydsl-mi325-1). Gate the 128-bit permlane-fused store behind gfx950; gfx942 falls back to a per-lane dwordx2 store packed via .to(elem_dtype) (arch-correct bf16/f16 conversion, same column layout, still num_records-bounded for OOB rows). Add FLYDSL_DISABLE_DUALWAVE_SWP / FLYDSL_GENERIC_OSTORE_SCALAR env hooks to exercise the generic kernel and its gfx942 store path on gfx950 hardware. Verified on gfx950 (MI355): the permlane and scalar O-store paths both give MaxErr 3.91e-3 vs SDPA across H8/16/64, GQA, and partial-seqlen configs; the default gfx950 dualwave path is unchanged (PASS, MaxErr 3.91e-3). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Ported from opus_align FMHA optimization work onto rocm/main base.
Motivation
Technical Details
Test Plan
Test Result
Submission Checklist