Skip to content

Conversation

shifangx
Copy link
Contributor

@shifangx shifangx commented Jul 30, 2025

This MR support nvfp4 low latency mode dispatch.
We use the following message package format while dispatching tokens.
image

@shifangx shifangx force-pushed the shifang/ll_dispatch_nvfp4 branch 2 times, most recently from fe83c6c to 0a7f43e Compare July 30, 2025 14:29
@shifangx shifangx changed the title spport NVFP4 for low latency mode dispatch Support nvfp4 low latency mode dispatch Jul 30, 2025
@shifangx shifangx force-pushed the shifang/ll_dispatch_nvfp4 branch from 0a7f43e to 5cd59de Compare August 22, 2025 10:18
@ishandhanani
Copy link

@shifangx - can you explain how to build this from source?

@shifangx
Copy link
Contributor Author

@shifangx - can you explain how to build this from source?

Hello, @ishandhanani, thank you for your attention to our work.
The build method for this PR is the same as that of the main branch.
Did you encounter any issues during the build process?

@shifangx shifangx force-pushed the shifang/ll_dispatch_nvfp4 branch 7 times, most recently from c358fd5 to 1be895a Compare August 29, 2025 06:45
@DoubleClark
Copy link

May i ask the quant_method of fp4 model,it seems that you use 16 elements as group instead of 128 to reduce the accuracy loss, but i still wonder the quant method and how its performance compared with orginal fp8 model? Besides, may i ask the computation type of the following gemm, it seems that activation, weight is fp4, in blackwell, the result might be fp32 accumulated, how it works with 8 bit scale? does it have possibility work in hopper [fp4 dequant to fp8 might cause some scale transform] ? if you could share the fp4 gemm application in hopper and blackwell, it will be great help, thanks.

@shifangx
Copy link
Contributor Author

shifangx commented Aug 29, 2025

May i ask the quant_method of fp4 model,it seems that you use 16 elements as group instead of 128 to reduce the accuracy loss, but i still wonder the quant method and how its performance compared with orginal fp8 model? Besides, may i ask the computation type of the following gemm, it seems that activation, weight is fp4, in blackwell, the result might be fp32 accumulated, how it works with 8 bit scale? does it have possibility work in hopper [fp4 dequant to fp8 might cause some scale transform] ? if you could share the fp4 gemm application in hopper and blackwell, it will be great help, thanks.

@DoubleClark, Hi, if you are interested in FP4 training, perhaps these blogs can provide some help.

  • For inference:

https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/

  • For training:

https://developer.nvidia.com/blog/nvfp4-trains-with-precision-of-16-bit-and-speed-and-efficiency-of-4-bit/

@shifangx shifangx force-pushed the shifang/ll_dispatch_nvfp4 branch 4 times, most recently from dad206a to 0cfe452 Compare September 1, 2025 03:30
@shifangx shifangx force-pushed the shifang/ll_dispatch_nvfp4 branch from cb1757a to 9d9e395 Compare September 3, 2025 10:01
@shifangx shifangx force-pushed the shifang/ll_dispatch_nvfp4 branch 6 times, most recently from bf1f716 to 5deac0f Compare September 6, 2025 05:30
change from x_sf_scale to x_global_scales.
change from use_ue8m0_for_sf to use_ue8m0_for_nvfp4_x_scale.
set x_scale dtpye as torch::kFloat8_e4m3fn for if use_ue8m0_for_nvfp4_x_scale==False and torch::kUInt8 for use_ue8m0_for_nvfp4_x_scale==True.
@shifangx shifangx force-pushed the shifang/ll_dispatch_nvfp4 branch 6 times, most recently from 63ad6b4 to 8cc65fd Compare September 9, 2025 15:30
@shifangx shifangx force-pushed the shifang/ll_dispatch_nvfp4 branch from 8cc65fd to d89a25b Compare September 10, 2025 09:43
const auto dim1_offset = j / num_elems_per_pack;
const auto dim4_offset = j % num_elems_per_pack;
auto scale = ld_nc_global(src_scales + j);
const auto offset = dim0_offset * dim0_stride + dim1_offset * dim1_stride + dim2_offset * dim2_stride + dim3_offset * dim3_stride + dim4_offset;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: looks like the physical layout is 6D, thus curious why we only have 5 dim here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your kindly review.

recv_x_scales[offset] = scale;
recv_x_scales is only for one expert, so its layout is 5D.

Copy link
Contributor

@fzyzcjy fzyzcjy Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i see, looks reasonable

@shifangx
Copy link
Contributor Author

For the fp4 quantize, this PR refer to cvt_warp_fp16_to_fp4 in
https://github.com/flashinfer-ai/flashinfer/blob/88e333e038c6fce317e959261f165a0618fa4f3c/csrc/nv_internal/tensorrt_llm/kernels/quantization.cuh#L402

For the scale layout and shape, this PR refer to test_quantize_to_fp4_grouped
in https://github.com/sgl-project/sglang/blob/b0d25e72c401f37b55d689ddbf05b8c583afe854/sgl-kernel/tests/test_fp4_quantize.py#L178

@fzyzcjy
Copy link
Contributor

fzyzcjy commented Sep 12, 2025

accuracy issue is fixed now

@kaixih
Copy link

kaixih commented Oct 2, 2025

@shifangx Anything blocking this merge?

@shifangx shifangx changed the base branch from main to hybrid-ep October 7, 2025 03:46
@shifangx
Copy link
Contributor Author

shifangx commented Oct 7, 2025

@shifangx Anything blocking this merge?

I will merge #341 into hybrid-ep branch after vacation.
The reason why the NVFP4 dispatch PR cannot be merged into main is that the VFP4 recipe might be changed in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants