HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3 #14624

deepsek · 2025-07-10T18:31:04Z

Added Matrix cores support (MFMA instructions) for MMQ kernels.
Enable stream-K for CDNA3 to work with MMQ kernels.
Removed usage of WARP_SIZE hardcoded constant in MMQ kernels.
NOTE: Thoughts on removing all uses of hardcoded const specific to only NVIDIA (like WARP_SIZE) in order to support other GPUs?

@JohannesGaessler @ggerganov
P.S. I am part of an AMD team actively working on enabling AMD feature set on llama.cpp. We would like to get on call to discuss some future PR plans for additional backends, flash attention changes, etc.

EDIT:
Update to add some performance charts for DeepSeekV3 model.

Upstream vs ROCm Fork Development

MI300X vs H100 Throughput Test

JohannesGaessler · 2025-07-10T18:47:19Z

I would be happy to get on a call with you to discuss AMD hardware support, my email address can be found on my Github page.

ggerganov · 2025-07-11T07:48:35Z

P.S. I am part of an AMD team actively working on enabling AMD feature set on llama.cpp. We would like to get on call to discuss some future PR plans for additional backends, flash attention changes, etc.

@deepsek Thanks for the contribution and for reaching out. On topics related to the CUDA backend, @JohannesGaessler is the best person to consult with. For additional backends, @slaren can provide guidelines and advice. I'll be happy to provide input on any matters as well.

I am also available for call - feel free to contact me.

Dampfinchen · 2025-07-11T12:18:25Z

Very nice to see the initiative. I assume improvements made for CDNA will also swap into the consumer side next year when UDNA releases. So this is exciting news for the future of AMD products!

IMbackK · 2025-07-12T21:21:30Z

This certainly is good news

JohannesGaessler · 2025-07-12T21:51:15Z

Sorry, I wanted to ask: @IMbackK since you've been working on AMD support, are you interested in joining the discussion?

IMbackK · 2025-07-14T16:44:52Z

Sorry, I wanted to ask: @IMbackK since you've been working on AMD support, are you interested in joining the discussion?

Yes, certainly. It would help to avoid duplication of effort. i can be reached via email at uvos.xyz user carl

deepsek · 2025-07-21T11:23:31Z

Hi @JohannesGaessler, is there any blocker for merging this PR to the main branch?

IMbackK · 2025-07-21T12:19:57Z

@deepsek There are a few small things as discussed, better naming for this mfma path so that a rdna wmma solution can be added later without the nameing being strange is one thing, use of two V_MFMA_I32_16X16X16I8 instructions on gfx908 and gfx90a, even if this path is not chosen for those, to ease maintainability is another.

I would also like to try this myself on gfx94x somehow and i am not sure what the state is with regard to access to amds cloud for maintenance of a gfx94x specific code path, maybe @ggerganov can also comment on that. A problem here being that after cdna2/gfx90a/mi210 AMD has not made any further CDNA devices that are in a pcie addon board form factor, so out side of the acquisition of an entire mi300 oam machine no one can simply add a CDNA3/gfx94x/MI3xx compatible card to their system.

deepsek · 2025-07-21T12:30:56Z

@IMbackK ,

regarding the naming, yes ofcourse. any suggestions on what you think would be appropriate? I see FP16_MMA_AVAILABLE being used. Thoughts on just using INT8_MMA_AVAILABLE?
For this PR which is specifically targeting CDNA3, I don't think the V_MFMA_I32_16X16X16I8 for CDNA2 & CDNA1 should be a blocker for CDNA3 specific workloads since the code paths are currently different. We can open a new PR for that, Even though we can add 2 instructions, it would need testing, etc
Since the performance for CDNA3 is significantly higher than the upstream version (see pics in the main comment for PR), we would like to see this change available to public asap. We can incorporate other architectures in other PRs.
Regarding access to MI300X instances to public, anyone should be able to spin up a instance on amd dev cloud.
https://www.amd.com/en/developer/resources/cloud-access/amd-developer-cloud.html

ggerganov · 2025-07-21T13:48:16Z

I tested this PR on my RTX 2060 and don't observe any regressions.

@deepsek If you run the ./bin/test-backed-ops -o MUL_MAT do all tests pass on the target hardware?

JohannesGaessler

This is only a partial review because I'm noticing that some calculations on master are wrong in terms of the logic (but correct in terms of the values). My suggestion for this PR is that I fix the incorrect logic on master first and that we then rebase this PR.

My code in mma.cuh implicitly assumes that for int8 and FP16 the raw data layout across threads in a warp is the same. Please tell me if this assumption is wrong for AMD.

ggml/src/ggml-cuda/mma.cuh

ggml/src/ggml-cuda/mmq.cuh

JohannesGaessler · 2025-07-21T15:10:25Z

any suggestions on what you think would be appropriate? I see FP16_MMA_AVAILABLE being used. Thoughts on just using INT8_MMA_AVAILABLE?

If the instructions are only available on CDNA3 my suggestion is CDNA3_MMA_AVAILABLE.

IMbackK · 2025-07-21T15:22:15Z

any suggestions on what you think would be appropriate? I see FP16_MMA_AVAILABLE being used. Thoughts on just using INT8_MMA_AVAILABLE?

If the instructions are only available on CDNA3 my suggestion is CDNA3_MMA_AVAILABLE.

the used MFMA instruction, V_MFMA_I32_16X16X32I8 is only available on CDNA3, the other CDNA devices support only k=16 ie V_MFMA_I32_16X16X16I8, obviously its pretty trivial to use V_MFMA_I32_16X16X32I8 vs V_MFMA_I32_16X16X16I8, whatever is available, the instructions are otherwise the same.

The other used MFMA instruction is the same with CDNA3 haveing V_MFMA_I32_32X32X16I8 and CDNA1/2 haveing only V_MFMA_I32_32X32X8I8

(consumer) rdna3+ gpus support the same-purpose V_WMMA_I32_16X16X16_IU8 but semantics and performance characteristics are quite different.

I would suggest the obvious AMD_MFMA_AVAILABLE and AMD_WMMA_AVAILABLE with the AMD_MFMA_AVAILABLE path using V_MFMA_I32_16X16X32I8 or V_MFMA_I32_16X16X16I8 based on the presence of the CDNA3 define.

ggml/src/ggml-cuda/mmq.cuh

JohannesGaessler · 2025-07-21T15:44:43Z

This is only a partial review because I'm noticing that some calculations on master are wrong in terms of the logic (but correct in terms of the values). My suggestion for this PR is that I fix the incorrect logic on master first and that we then rebase this PR.

It seems I misremembered how the code works, the part I though was wrong is correct after all. I'll continue with the review.

ggml/src/ggml-cuda/mmq.cuh

deepsek · 2025-07-22T02:35:42Z

@JohannesGaessler, I've refactored the code per your suggestions in the comments above.
@ggerganov, No regressions on my end too. All tests passed!

@IMbackK,

Per your suggestion, renamed amd_mma to amd_mfma
I've also added the MFMA instructions for CDNA2 (CDNA uses same instructions) in the mma.cuh file. Even though the instructions are added, the code path is disabled in common.cuh with amd_mfma_available. Same goes for stream-K in mmq.cu file.
I've tested for functionality, and it works. But its disabled by default because, for llama3.1-8B model on CDNA2, with stream-K and MFMA enabled, the performance is better only up until prompt size of 256/512. Anything above (512-4096 tested), the performance drops with the use of the matrix cores. I haven't had the chance to test on CDNA but I suspect it will be the same as CDNA2.
We can maybe in the future, look into enabling MFMA code path for only smaller prompt sizes in the legacy CDNAs. But it's not in my roadmap at the moment.

deepsek and others added 14 commits May 23, 2025 19:43

Feat: Enable MFMA instr for Q4_K

68da4e5

Fix: Missed template param

79f348a

Feat: Add MFMA instr for Q6_K, remove MMQ_NWARPS

89ba8a6

Merge branch 'ggml-org:master' into amd-integration

e57e563

Merge branch 'ggml-org:master' into amd-integration

9784a51

Merge branch 'ggml-org:master' into amd-integration

dad79b3

Perf: Fix Register Spilling Q6_K - Refactor kernel, launch_bound

ff60fa9

Perf: Refactor Q4_K, reduce register pressure

e8eeb34

Perf: Throughput Increase 4k->6.9k t/s

a161900

Perf: 7.1k tokens/sec

75d386a

Perf/Feat: Throughput 8.3k tokens/sec, Add support for all quants

0215a80

Feat: Remove warnings, deprecated __AMDGCN_WAVEFRONT_SIZE

aa35feb

Merge branch 'master' into amd-integration

ba17f62

Feat: Enable stream-k for CDNA3

5ab1491

deepsek requested a review from JohannesGaessler as a code owner July 10, 2025 18:31

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jul 10, 2025

Fix: Remove Trailing Whitespaces

fb2fd31

deepsek added 2 commits July 14, 2025 18:34

Fix: Unused Params Warnings, CUDA Build

b55d44a

-p512: 8.4k->9.5k - Account for DataPadding for writing tile_y

ab7c007

deepsek requested a review from ngxson as a code owner July 15, 2025 16:53

github-actions bot added the devops improvements to build systems and github actions label Jul 15, 2025

IMbackK self-assigned this Jul 21, 2025

JohannesGaessler reviewed Jul 21, 2025

View reviewed changes

deepsek commented Jul 21, 2025

View reviewed changes

ggml/src/ggml-cuda/mmq.cuh Show resolved Hide resolved

JohannesGaessler reviewed Jul 21, 2025

View reviewed changes

refactor: PR code cleanup, amd_mma_available->amd_mfma_available

279b51e

HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3 #14624

Are you sure you want to change the base?

HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3 #14624

Conversation

deepsek commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Jul 10, 2025

Uh oh!

ggerganov commented Jul 11, 2025

Uh oh!

Dampfinchen commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IMbackK commented Jul 12, 2025

Uh oh!

JohannesGaessler commented Jul 12, 2025

Uh oh!

IMbackK commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deepsek commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IMbackK commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deepsek commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jul 21, 2025

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler commented Jul 21, 2025

Uh oh!

IMbackK commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler commented Jul 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

deepsek commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

deepsek commented Jul 10, 2025 •

edited

Loading

Dampfinchen commented Jul 11, 2025 •

edited

Loading

IMbackK commented Jul 14, 2025 •

edited

Loading

deepsek commented Jul 21, 2025 •

edited

Loading

IMbackK commented Jul 21, 2025 •

edited

Loading

deepsek commented Jul 21, 2025 •

edited

Loading

IMbackK commented Jul 21, 2025 •

edited

Loading

deepsek commented Jul 22, 2025 •

edited

Loading