Skip to content

[gfx1250][gemm] Make mxscale B-scale preshuffle tile-independent#679

Draft
aoli26 wants to merge 13 commits into
gfx1250/gemm_ptpcfrom
gfx1250/gemm_exp_aoli
Draft

[gfx1250][gemm] Make mxscale B-scale preshuffle tile-independent#679
aoli26 wants to merge 13 commits into
gfx1250/gemm_ptpcfrom
gfx1250/gemm_exp_aoli

Conversation

@aoli26

@aoli26 aoli26 commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Motivation

Decouple gfx1250 MXScale B-scale preshuffle from GEMM tile shape, and keep A-scale on an independent path instead of tying it to the new B-scale layout.

Technical Details

  • Add tile-independent N4K4 B-scale preshuffle for FP8/A8W4/FP4 MXScale kernels.
  • Add kernel-side N4K4 B-scale LDS load support, including b32/b64/b128 raw LDS loads.
  • A-scale can either remain on the normal path or be loaded separately via buffer load.
  • Update host-side scale preshuffle helpers, test config selection, CLI args, and benchmark/graph plumbing for the new paths.

Test Plan

python3 -m pytest tests/kernels/test_gemm_fp8fp4_gfx1250.py

Test Result

All tests passed.

Submission Checklist

aoli26 added 12 commits June 10, 2026 14:22
- runtime: capture cluster kernels into hipGraph nodes via
  hipGraphAddKernelNode, relying on baked-in amdgpu-cluster-dims
  (graph-node cluster attribute is unsupported on this HIP build)
- test: parametrize test_mxscale_gemm_cudagraph with cluster cases;
  run functional test through hipGraph (graph vs ref) when --use-graph
  is set, covering --use-graph and --benchmark --use-graph --verify
@aoli26 aoli26 force-pushed the gfx1250/gemm_exp_aoli branch from 7038717 to dc49c0e Compare June 15, 2026 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant