Skip to content

Conversation

fzyzcjy
Copy link
Contributor

@fzyzcjy fzyzcjy commented Jul 3, 2025

The code diff is surely not for merging, but for demonstration how the experiments below are done. If anyone is interested / this direction looks acceptable to be merged, I am happy to polish and further work on the code!

The code and experiment data are extracted from old experiments for my previous #249.

Figure 1: num-sm vs performance
As can be seen, when using 9 warpgroup - ie few SMs, the performance only slightly slow down. Thus this makes a simple overlapping between this and computation feasible.

image

For dispatch we may need to do extra work though, since the warp specialization may be suboptimal when there are few SMs.

@fzyzcjy fzyzcjy changed the title Use few SMs for low-latency mode with almost full speed Use few SMs for low-latency mode Jul 3, 2025
@fzyzcjy fzyzcjy changed the title Use few SMs for low-latency mode Allow using few SMs for low-latency mode Jul 3, 2025
@rubbberrabbit
Copy link

rubbberrabbit commented Aug 27, 2025

As can be seen, when using 9 warpgroup - ie few SMs, the performance only slightly slow down. Thus this makes a simple overlapping between this and computation feasible.

Hi, that is a good idea, but I want to understand the mentioned “overlap” in more detail. Does it refer to the overlap between the dispatch/combine kernels and the model computation kernels—i.e. two separate streams like in prefill? But during decode a CUDA Graph is enabled; wouldn’t the CUDA Graph turn them into a single sequential execution stream and eliminate that overlap?

@alpha-baby
Copy link
Contributor

Deepep kernel uses less SM, so who will use the extra SM? For example, in the decode phase of sglang, the communication kernel and the compute kernel are serial.

@rubbberrabbit
Copy link

Deepep kernel uses less SM, so who will use the extra SM? For example, in the decode phase of sglang, the communication kernel and the compute kernel are serial.

I think maybe it involves two batch overlap,so the the communication kernel and the compute kernel can run at the same time ideally. But still as I mentioned,after with CUDA graph,it transfer to serial kernel. I am very curious how to avoid it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants