[Kernel] CUTLASS grouped gemm fp8 MoE kernel #13972

ElizaWszola · 2025-02-27T15:52:58Z

CUTLASS implementation of fp8 MoE kernel.

Tested with

llm = LLM(model="nm-testing/DeepSeek-Coder-V2-Lite-Instruct-FP8",
          trust_remote_code=True,
          tensor_parallel_size=2,
 )

Benchmark (Deepseek V2 Lite, total time of 25 runs)

[--------------------------------------------------------------------------------------------------------- Quant Matmul ---------------------------------------------------------------------------------------------------------]
                                                                                                                                    |  triton_moe  |  triton_moe_cuda_graphs  |  grouped_gemm_moe  |  grouped_gemm_moe_cuda_graphs
1 threads: -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
      nm-testing/deepseekv2-lite, num_experts=64, topk=6, per_act_token=False per_out_ch=False, MKN=((1, 2048, 1408))               |      3.6     |            2.6           |         3.6        |               3.3            
      nm-testing/deepseekv2-lite, num_experts=64, topk=6, per_act_token=False per_out_ch=False, MKN=((4, 2048, 1408))               |      6.8     |            6.7           |         4.7        |               4.3            
      nm-testing/deepseekv2-lite, num_experts=64, topk=6, per_act_token=False per_out_ch=False, MKN=((8, 2048, 1408))               |     10.1     |           10.0           |         5.6        |               5.1            
      nm-testing/deepseekv2-lite, num_experts=64, topk=6, per_act_token=False per_out_ch=False, MKN=((16, 2048, 1408))              |     15.0     |           14.9           |         6.8        |               6.3            
      nm-testing/deepseekv2-lite, num_experts=64, topk=6, per_act_token=False per_out_ch=False, MKN=((32, 2048, 1408))              |     16.9     |           16.8           |         7.3        |               6.9            
      nm-testing/deepseekv2-lite, num_experts=64, topk=6, per_act_token=False per_out_ch=False, MKN=((64, 2048, 1408))              |     17.0     |           16.9           |         7.6        |               7.1            
      nm-testing/deepseekv2-lite, num_experts=64, topk=6, per_act_token=False per_out_ch=False, MKN=((128, 2048, 1408))             |      8.5     |            8.4           |         8.1        |               7.6            
      nm-testing/deepseekv2-lite, num_experts=64, topk=6, per_act_token=False per_out_ch=False, MKN=((256, 2048, 1408))             |      9.1     |            9.0           |         9.0        |               8.5            
      nm-testing/deepseekv2-lite, num_experts=64, topk=6, per_act_token=False per_out_ch=False, MKN=((512, 2048, 1408))             |     10.9     |           10.8           |        10.6        |              10.1          
(times are in ms)

Signed-off-by: ElizaWszola <[email protected]>

Co-authored-by: Lucas Wilkinson <[email protected]> Signed-off-by: ElizaWszola <[email protected]>

Signed-off-by: ElizaWszola <[email protected]>

…of tensors Signed-off-by: ElizaWszola <[email protected]>

Signed-off-by: ElizaWszola <[email protected]>

github-actions · 2025-02-27T15:53:09Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify · 2025-02-27T15:53:41Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ElizaWszola.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: ElizaWszola <[email protected]>

ElizaWszola and others added 30 commits December 6, 2024 14:36

Cutlass grouped gemm files

1825ef8

Signed-off-by: ElizaWszola <[email protected]>

runs, bad result

5fd48e5

Signed-off-by: ElizaWszola <[email protected]>

A little closer to working

d5942cf

Signed-off-by: ElizaWszola <[email protected]>

Working for identical sizes

c570c69

Signed-off-by: ElizaWszola <[email protected]>

Grouped gemm working

6ed63f2

Co-authored-by: Lucas Wilkinson <[email protected]> Signed-off-by: ElizaWszola <[email protected]>

Small cleanup

e2b1fc0

Signed-off-by: ElizaWszola <[email protected]>

Merge branch 'main' into grouped-gemm-with-group-id

dd163f5

Signed-off-by: ElizaWszola <[email protected]>

Benchmark grouped cutlass against bfloat16 torch.mm

acfd3ef

Signed-off-by: ElizaWszola <[email protected]>

Merge branch 'main' into grouped-gemm-with-group-id

c6231b6

Signed-off-by: ElizaWszola <[email protected]>

Start working on fused moe cutlass implementation

f1a5666

Signed-off-by: ElizaWszola <[email protected]>

Working halfway

6414e31

Signed-off-by: ElizaWszola <[email protected]>

working mul test but the topk_weights are not yet included in kernel

67e2dd4

Signed-off-by: ElizaWszola <[email protected]>

cleaned up cutlass moe test, fixes

6523529

Signed-off-by: ElizaWszola <[email protected]>

benchmark fused

b302d98

Signed-off-by: ElizaWszola <[email protected]>

pass input as one tensor with an array of offsets rather than a list …

342d1a4

…of tensors Signed-off-by: ElizaWszola <[email protected]>

Using tensors rather than tensor lists works with test_cutlass test

7549e3d

Signed-off-by: ElizaWszola <[email protected]>

Merge branch 'main' into grouped-gemm-with-group-id

64c2a68

Signed-off-by: ElizaWszola <[email protected]>

cleanup, add import

1ea7874

Signed-off-by: ElizaWszola <[email protected]>

working fused op

d608164

Signed-off-by: ElizaWszola <[email protected]>

benchmark, create strides directly on device, small name refactor

286f6c8

Signed-off-by: ElizaWszola <[email protected]>

works with cuda graphs

b6867bb

Signed-off-by: ElizaWszola <[email protected]>

move stride tensor creation outside c++ code, cleanup

df04bc0

Signed-off-by: ElizaWszola <[email protected]>

cleanup benchmark

88c7134

Signed-off-by: ElizaWszola <[email protected]>

profile

02e1d4e

Signed-off-by: ElizaWszola <[email protected]>

tuned shapes, fix

1d9c429

Signed-off-by: ElizaWszola <[email protected]>

Merge branch 'main' into grouped-gemm-with-group-id

b824ad2

Signed-off-by: ElizaWszola <[email protected]>

Performance, add channelwise scales everywhere

ae90eee

Signed-off-by: ElizaWszola <[email protected]>

name fix

f191b35

Signed-off-by: ElizaWszola <[email protected]>

Merge branch 'main' into grouped-gemm-with-group-id

22d4f7b

perf improvements in data preparation

51941ff

Signed-off-by: ElizaWszola <[email protected]>

ElizaWszola added 6 commits February 24, 2025 15:28

Integrate with deepseek v2

d3cf1db

Signed-off-by: ElizaWszola <[email protected]>

cudagraphs fix

175ecdd

Signed-off-by: ElizaWszola <[email protected]>

Merge branch 'main' into grouped-gemm-with-group-id

3d7a487

Signed-off-by: ElizaWszola <[email protected]>

larger index type to support very large batches

ec0cb94

Signed-off-by: ElizaWszola <[email protected]>

update benchmarks

6dd6d48

Signed-off-by: ElizaWszola <[email protected]>

Faster data preparation kernels, bring back correct benchmark shapes

716d8c0

Signed-off-by: ElizaWszola <[email protected]>

ElizaWszola requested review from tlrmchlsmth, WoosukKwon, mgoin and robertgshaw2-redhat as code owners February 27, 2025 15:52

mergify bot added the ci/build label Feb 27, 2025

mergify bot added the needs-rebase label Feb 27, 2025

enable cutlass grouped gemm only on sm90

975ab5f

Signed-off-by: ElizaWszola <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel] CUTLASS grouped gemm fp8 MoE kernel #13972

[Kernel] CUTLASS grouped gemm fp8 MoE kernel #13972

ElizaWszola commented Feb 27, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Feb 27, 2025

mergify bot commented Feb 27, 2025

[Kernel] CUTLASS grouped gemm fp8 MoE kernel #13972

Are you sure you want to change the base?

[Kernel] CUTLASS grouped gemm fp8 MoE kernel #13972

Conversation

ElizaWszola commented Feb 27, 2025 • edited by github-actions bot Loading

github-actions bot commented Feb 27, 2025

mergify bot commented Feb 27, 2025

ElizaWszola commented Feb 27, 2025 •

edited by github-actions bot

Loading