-
Notifications
You must be signed in to change notification settings - Fork 12.4k
HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3 #14624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
I would be happy to get on a call with you to discuss AMD hardware support, my email address can be found on my Github page. |
@deepsek Thanks for the contribution and for reaching out. On topics related to the CUDA backend, @JohannesGaessler is the best person to consult with. For additional backends, @slaren can provide guidelines and advice. I'll be happy to provide input on any matters as well. I am also available for call - feel free to contact me. |
Very nice to see the initiative. I assume improvements made for CDNA will also swap into the consumer side next year when UDNA releases. So this is exciting news for the future of AMD products! |
This certainly is good news |
Sorry, I wanted to ask: @IMbackK since you've been working on AMD support, are you interested in joining the discussion? |
Yes, certainly. It would help to avoid duplication of effort. i can be reached via email at uvos.xyz user carl |
Hi @JohannesGaessler, is there any blocker for merging this PR to the main branch? |
@deepsek There are a few small things as discussed, better naming for this mfma path so that a rdna wmma solution can be added later without the nameing being strange is one thing, use of two V_MFMA_I32_16X16X16I8 instructions on gfx908 and gfx90a, even if this path is not chosen for those, to ease maintainability is another. I would also like to try this myself on gfx94x somehow and i am not sure what the state is with regard to access to amds cloud for maintenance of a gfx94x specific code path, maybe @ggerganov can also comment on that. A problem here being that after cdna2/gfx90a/mi210 AMD has not made any further CDNA devices that are in a pcie addon board form factor, so out side of the acquisition of an entire mi300 oam machine no one can simply add a CDNA3/gfx94x/MI3xx compatible card to their system. |
@IMbackK ,
|
I tested this PR on my RTX 2060 and don't observe any regressions. @deepsek If you run the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only a partial review because I'm noticing that some calculations on master are wrong in terms of the logic (but correct in terms of the values). My suggestion for this PR is that I fix the incorrect logic on master first and that we then rebase this PR.
My code in mma.cuh
implicitly assumes that for int8 and FP16 the raw data layout across threads in a warp is the same. Please tell me if this assumption is wrong for AMD.
If the instructions are only available on |
the used MFMA instruction, V_MFMA_I32_16X16X32I8 is only available on CDNA3, the other CDNA devices support only k=16 ie V_MFMA_I32_16X16X16I8, obviously its pretty trivial to use V_MFMA_I32_16X16X32I8 vs V_MFMA_I32_16X16X16I8, whatever is available, the instructions are otherwise the same. The other used MFMA instruction is the same with CDNA3 haveing V_MFMA_I32_32X32X16I8 and CDNA1/2 haveing only V_MFMA_I32_32X32X8I8 (consumer) rdna3+ gpus support the same-purpose V_WMMA_I32_16X16X16_IU8 but semantics and performance characteristics are quite different. I would suggest the obvious AMD_MFMA_AVAILABLE and AMD_WMMA_AVAILABLE with the AMD_MFMA_AVAILABLE path using V_MFMA_I32_16X16X32I8 or V_MFMA_I32_16X16X16I8 based on the presence of the CDNA3 define. |
It seems I misremembered how the code works, the part I though was wrong is correct after all. I'll continue with the review. |
@JohannesGaessler, I've refactored the code per your suggestions in the comments above.
|
Added Matrix cores support (MFMA instructions) for MMQ kernels.
Enable stream-K for CDNA3 to work with MMQ kernels.
Removed usage of WARP_SIZE hardcoded constant in MMQ kernels.
NOTE: Thoughts on removing all uses of hardcoded const specific to only NVIDIA (like WARP_SIZE) in order to support other GPUs?
@JohannesGaessler @ggerganov
P.S. I am part of an AMD team actively working on enabling AMD feature set on llama.cpp. We would like to get on call to discuss some future PR plans for additional backends, flash attention changes, etc.
EDIT:
Update to add some performance charts for DeepSeekV3 model.
Upstream vs ROCm Fork Development

MI300X vs H100 Throughput Test
