add tileN = 8,16 for SM120 blockscale GEMM.#3292
Conversation
|
Hi @depaulmillz , I was wondering if you could take a look at this PR? Since I noticed you were the last one to add TileN = 32. Thanks! |
|
Awesome. Have you been able to try with group GEMM as well? |
|
@depaulmillz Yes. Technically, it works (this PR is also compatible with group GEMM changes as well). But for example when testing on two common cases, DSR1 TP = 8 and Qwen-3 MoE TP = 1, the speedup can only be 3-5% for BS = 1. So it's faster (as expected), but not by much. |
|
It looks like you will need to add an assertion to prevent compiling ping-pong with MMA_N=8 which will expect a (2,2,1) layout shape for the MMA. I saw some ref check errors when testing the MR on pingpong MMA_N=8 kernels due to this. |
|
@depaulmillz Just added a guard for it. Could you take a try now? |
It will be for use with SwapAB.