- mxfp4 enable for gfx950, including GEMM, MoE, and per1x32 Quant
- multi-gpu tuning enable for most kind of GEMMs
- fp8 all reduce
- numbers of triton kernels
What's Changed
- [TRITON] Add Triton Topk Kernel by @hubertlu-tw in #458
- Find executable in rocm home when not found in PATH by @xli in #549
- [TRITON]: Disable int4 moe UT by @rahulbatra85 in #563
- add a4w4 asm_moe by @valarLip in #482
- Improved detection of setup.py install by @ekuznetsov139 in #534
- Disable mha related modules in prebuild by @slippedJim in #567
- Fix format error in .clang-format by @poyenc in #568
- update pa asm by @amd-ruitang3 in #553
- [TRITON]: Reorg mha code and use common fp8 type by @rahulbatra85 in #561
- [TRITON]: Gemm refactor by @rahulbatra85 in #558
- [Triton]: Add has_attr check in get_config by @rahulbatra85 in #572
- [TRITON]: GEMM updates for DS by @rahulbatra85 in #573
- update_codegen by @amd-ruitang3 in #581
- mi350_pa by @amd-ruitang3 in #579
- Change input tensor format to [B,S,H,d] and add batch support for causal by @valechen in #578
- update tune config file by @solinzby1 in #569
- [TRITON] Add RMSNorm bwd Triton Kernels by @lucas-santos-amd in #576
- fix prebuild by @junhaha666 in #592
- [TRITON]: Quantization updates(add int8 and use common fp8 dtypes) by @rahulbatra85 in #588
- Dispatch combine by @junhaha666 in #571
- update args by @amd-ruitang3 in #590
- Pa rocm refresh4 by @fsx950223 in #591
- [update]: update all-reduce by @TennyWang1223 in #552
- Fix compile error in MI350 with ROCm7 by @rocking5566 in #599
- new codegen for elementwise by @TennyWang1223 in #585
- [fix]: elementwise prebuild slow by @TennyWang1223 in #609
- [TRITON]: Fp4gemm m=256 tuning by @Chi-Chu319 in #533
- add MI350 support for skinny_gemm by @yanguahe in #602
- Fix prebuild 350 by @junhaha666 in #608
- [fix]: change ar namespace by @TennyWang1223 in #611
- compile flag clean up by @valarLip in #615
- DIY_args by @amd-ruitang3 in #596
- fix NUM_Q_HEADS - 1 in remap_xcd in _attn_fwd by @juuso-oskari in #612
- add ck gemm a4w4 blockscale with splitK support by @ukannika-amd in #603
- [TRITON]: pid grid fix by @Chi-Chu319 in #618
- Refine ck instance and update a8w8_bpreshuffle_tuned_gemm.csv by @solinzby1 in #621
- merge moe from 350 launch by @lalala-sh in #580
- Remove seqlen limit on FA fwd kernel by @slippedJim in #622
- (Triton] RoPE dev by @k50112113 in #606
- [TRITON]: Fix num_warps typo which was causing performance issues by @valechen in #604
- Topksoftmax_opt by @junhaha666 in #626
- update hip quant for corner case by @valarLip in #633
- [TRITON]: use int64 strides by default for MHA by @rahulbatra85 in #634
- [TRITON]: Standardize GEMM weight shape to (N, K) and TN memory layout (by default) by @willzhou-amd in #597
- [TRITON] Add Softmax Triton Kernel by @lucas-santos-amd in #605
- Enable gfx942 FA fwd asm kernels by @slippedJim in #619
- Update CK by @poyenc in #635
- Fix error message for rocminfo by @Rohan138 in #636
- [TRITON]: Moe tuning mi350 by @Chi-Chu319 in #610
- Fix test_pa_ragged.py use_alibi=True test cases by @poyenc in #639
- Fix FA fwd nan issue by @slippedJim in #646
- fix for fp8 e4m3fn by @valarLip in #640
- [TRITON]: Kernel benchmarking improvements (for op_benchmarks/triton) by @willzhou-amd in #594
- [Triton]: Disable fused+causal for MHA bkwd by @rahulbatra85 in #642
- enable parallel tuning on CK kernels by @yzhou103 in #625
- Pa fix2 by @fsx950223 in #645
- Update dependencies and add backup for unknown hw by @kunaltyagi in #623
- Optimize topksoftmax WARPS_PER_TB for higher occupancy and remove redundant precision conversion by @CuiCu-618 in #652
New Contributors
- @hubertlu-tw made their first contribution in #458
- @xli made their first contribution in #549
- @ekuznetsov139 made their first contribution in #534
- @valechen made their first contribution in #578
- @willzhou-amd made their first contribution in #597
- @Rohan138 made their first contribution in #636
- @yzhou103 made their first contribution in #625
- @kunaltyagi made their first contribution in #623
- @CuiCu-618 made their first contribution in #652
Full Changelog: v0.1.3...v0.1.4