Support nvfp4 low latency mode dispatch #341

shifangx · 2025-07-30T08:38:12Z

This MR support nvfp4 low latency mode dispatch.
In the dispatch kernel, we first convert input to nvfp4, and then send to other ranks.
So it's a fused (quantize + dispatch) kernel, and the global_scaling_factor for nvfp4 is a scalar across all experts.

test_log.md

ishandhanani · 2025-08-22T23:00:38Z

@shifangx - can you explain how to build this from source?

shifangx · 2025-08-28T04:52:38Z

@shifangx - can you explain how to build this from source?

Hello, @ishandhanani, thank you for your attention to our work.
The build method for this PR is the same as that of the main branch.
Did you encounter any issues during the build process?

DoubleClark · 2025-08-29T07:26:38Z

May i ask the quant_method of fp4 model，it seems that you use 16 elements as group instead of 128 to reduce the accuracy loss, but i still wonder the quant method and how its performance compared with orginal fp8 model? Besides, may i ask the computation type of the following gemm, it seems that activation, weight is fp4， in blackwell, the result might be fp32 accumulated, how it works with 8 bit scale? does it have possibility work in hopper [fp4 dequant to fp8 might cause some scale transform] ? if you could share the fp4 gemm application in hopper and blackwell, it will be great help, thanks.

shifangx · 2025-08-29T07:45:54Z

May i ask the quant_method of fp4 model，it seems that you use 16 elements as group instead of 128 to reduce the accuracy loss, but i still wonder the quant method and how its performance compared with orginal fp8 model? Besides, may i ask the computation type of the following gemm, it seems that activation, weight is fp4， in blackwell, the result might be fp32 accumulated, how it works with 8 bit scale? does it have possibility work in hopper [fp4 dequant to fp8 might cause some scale transform] ? if you could share the fp4 gemm application in hopper and blackwell, it will be great help, thanks.

@DoubleClark, Hi, if you are interested in FP4 training, perhaps these blogs can provide some help.

For inference:

https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/

For training:

https://developer.nvidia.com/blog/nvfp4-trains-with-precision-of-16-bit-and-speed-and-efficiency-of-4-bit/

change from x_sf_scale to x_global_scales. change from use_ue8m0_for_sf to use_ue8m0_for_nvfp4_x_scale. set x_scale dtpye as torch::kFloat8_e4m3fn for if use_ue8m0_for_nvfp4_x_scale==False and torch::kUInt8 for use_ue8m0_for_nvfp4_x_scale==True.

fzyzcjy · 2025-09-10T12:29:43Z

csrc/kernels/internode_ll.cu

+                    const auto dim1_offset = j / num_elems_per_pack;
+                    const auto dim4_offset = j % num_elems_per_pack;
+                    auto scale = ld_nc_global(src_scales + j);
+                    const auto offset = dim0_offset * dim0_stride + dim1_offset * dim1_stride + dim2_offset * dim2_stride + dim3_offset * dim3_stride + dim4_offset;


qq: looks like the physical layout is 6D, thus curious why we only have 5 dim here

Thanks for your kindly review.

recv_x_scales[offset] = scale;
recv_x_scales is only for one expert, so its layout is 5D.

oh i see, looks reasonable

shifangx · 2025-09-11T23:26:49Z

For the fp4 quantize, this PR refer to cvt_warp_fp16_to_fp4 in
https://github.com/flashinfer-ai/flashinfer/blob/88e333e038c6fce317e959261f165a0618fa4f3c/csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh#L402

For the scale layout and shape, this PR refer to test_quantize_to_fp4_grouped
in https://github.com/sgl-project/sglang/blob/b0d25e72c401f37b55d689ddbf05b8c583afe854/sgl-kernel/tests/test_fp4_quantize.py#L178

fzyzcjy · 2025-09-12T07:41:07Z

accuracy issue is fixed now

kaixih · 2025-10-02T16:18:40Z

@shifangx Anything blocking this merge?

shifangx · 2025-10-07T03:47:55Z

@shifangx Anything blocking this merge?

I will merge #341 into hybrid-ep branch after vacation.
The reason why the NVFP4 dispatch PR cannot be merged into main is that the VFP4 recipe might be changed in the future.

shifangx force-pushed the shifang/ll_dispatch_nvfp4 branch 2 times, most recently from fe83c6c to 0a7f43e Compare July 30, 2025 14:29

shifangx changed the title ~~spport NVFP4 for low latency mode dispatch~~ Support nvfp4 low latency mode dispatch Jul 30, 2025

fzyzcjy mentioned this pull request Aug 11, 2025

Support NVFP4 masked layout MoE sgl-project/sglang#7994

Closed

9 tasks

shifangx force-pushed the shifang/ll_dispatch_nvfp4 branch from 0a7f43e to 5cd59de Compare August 22, 2025 10:18

fzyzcjy mentioned this pull request Aug 25, 2025

Tiny fix comments about package format #376

Merged

shifangx force-pushed the shifang/ll_dispatch_nvfp4 branch from 585144f to 74b631a Compare August 28, 2025 04:40

shifangx force-pushed the shifang/ll_dispatch_nvfp4 branch 7 times, most recently from c358fd5 to 1be895a Compare August 29, 2025 06:45

shifangx force-pushed the shifang/ll_dispatch_nvfp4 branch 4 times, most recently from dad206a to 0cfe452 Compare September 1, 2025 03:30

shifangx added 6 commits September 3, 2025 03:00

support NVFP4 data format in low latency dispatch

2bf764c

add support fp32_vec_to_e2m1 for __CUDA_ARCH__ less than 1000

d320aaa

change threshold for diff

d88e77e

add debug message

3a28b71

change physical layout to be (l, m/128, k/4, 32, 4, 4)

2add019

use global scale for entire dispatch instead of per token scale

9d9e395

shifangx force-pushed the shifang/ll_dispatch_nvfp4 branch from cb1757a to 9d9e395 Compare September 3, 2025 10:01

change test case

ccf4eaf

change some names and dtype:

82147f2

change from x_sf_scale to x_global_scales. change from use_ue8m0_for_sf to use_ue8m0_for_nvfp4_x_scale. set x_scale dtpye as torch::kFloat8_e4m3fn for if use_ue8m0_for_nvfp4_x_scale==False and torch::kUInt8 for use_ue8m0_for_nvfp4_x_scale==True.

shifangx force-pushed the shifang/ll_dispatch_nvfp4 branch 6 times, most recently from 63ad6b4 to 8cc65fd Compare September 9, 2025 15:30

shifangx added 2 commits September 10, 2025 01:22

support padding m

fc15ca6

calibrate nvfp4 scale layout with grouped gemm

d89a25b

shifangx force-pushed the shifang/ll_dispatch_nvfp4 branch from 8cc65fd to d89a25b Compare September 10, 2025 09:43

fzyzcjy reviewed Sep 10, 2025

View reviewed changes

Fix wrong accuracy

87c9f8f

fzyzcjy mentioned this pull request Sep 15, 2025

One branch that contains several optimizations and features #405

Draft

shifangx changed the base branch from main to hybrid-ep October 7, 2025 03:46

Merge branch 'hybrid-ep' into shifang/ll_dispatch_nvfp4

5437ebf

wenscarl mentioned this pull request Oct 18, 2025

[MoE] CuteDSL MoE with Nvfp4 DeepEP dispatch vllm-project/vllm#27141

Open

5 tasks

shifangx added 9 commits October 18, 2025 19:11

Modify doc

30ddbe6

modefy nvfp4 convert helper function in test

eff17ab

fix issue with align

3f6d862

modefy test

08183a8

Merge branch 'hybrid-ep' into shifang/ll_dispatch_nvfp4

ebe627e

add msb_first flag in test helper function

7cb6c5e

add msb_first flag in test helper function

009d527

change test file name

e192407

change test run time

8d5bee3

jershi425 merged commit ef73fd9 into deepseek-ai:hybrid-ep Oct 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support nvfp4 low latency mode dispatch #341

Support nvfp4 low latency mode dispatch #341

shifangx commented Jul 30, 2025 •

edited

Loading

Uh oh!

ishandhanani commented Aug 22, 2025

Uh oh!

shifangx commented Aug 28, 2025

Uh oh!

DoubleClark commented Aug 29, 2025

Uh oh!

shifangx commented Aug 29, 2025 •

edited

Loading

Uh oh!

fzyzcjy Sep 10, 2025

Uh oh!

shifangx Sep 10, 2025

Uh oh!

fzyzcjy Sep 10, 2025 •

edited

Loading

Uh oh!

shifangx commented Sep 11, 2025

Uh oh!

fzyzcjy commented Sep 12, 2025

Uh oh!

kaixih commented Oct 2, 2025

Uh oh!

shifangx commented Oct 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Support nvfp4 low latency mode dispatch #341

Support nvfp4 low latency mode dispatch #341

Conversation

shifangx commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ishandhanani commented Aug 22, 2025

Uh oh!

shifangx commented Aug 28, 2025

Uh oh!

DoubleClark commented Aug 29, 2025

Uh oh!

shifangx commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fzyzcjy Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

shifangx Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

fzyzcjy Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shifangx commented Sep 11, 2025

Uh oh!

fzyzcjy commented Sep 12, 2025

Uh oh!

kaixih commented Oct 2, 2025

Uh oh!

shifangx commented Oct 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

shifangx commented Jul 30, 2025 •

edited

Loading

shifangx commented Aug 29, 2025 •

edited

Loading

fzyzcjy Sep 10, 2025 •

edited

Loading