Allow using few SMs for low-latency mode #277

fzyzcjy · 2025-07-03T10:06:08Z

The code diff is surely not for merging, but for demonstration how the experiments below are done. If anyone is interested / this direction looks acceptable to be merged, I am happy to polish and further work on the code!

The code and experiment data are extracted from old experiments for my previous #249.

Figure 1: num-sm vs performance
As can be seen, when using 9 warpgroup - ie few SMs, the performance only slightly slow down. Thus this makes a simple overlapping between this and computation feasible.

For dispatch we may need to do extra work though, since the warp specialization may be suboptimal when there are few SMs.

rubbberrabbit · 2025-08-27T06:01:38Z

As can be seen, when using 9 warpgroup - ie few SMs, the performance only slightly slow down. Thus this makes a simple overlapping between this and computation feasible.

Hi, that is a good idea, but I want to understand the mentioned “overlap” in more detail. Does it refer to the overlap between the dispatch/combine kernels and the model computation kernels—i.e. two separate streams like in prefill? But during decode a CUDA Graph is enabled; wouldn’t the CUDA Graph turn them into a single sequential execution stream and eliminate that overlap?

alpha-baby · 2025-08-28T07:00:19Z

Deepep kernel uses less SM, so who will use the extra SM? For example, in the decode phase of sglang, the communication kernel and the compute kernel are serial.

rubbberrabbit · 2025-09-01T02:29:53Z

Deepep kernel uses less SM, so who will use the extra SM? For example, in the decode phase of sglang, the communication kernel and the compute kernel are serial.

I think maybe it involves two batch overlap，so the the communication kernel and the compute kernel can run at the same time ideally. But still as I mentioned，after with CUDA graph，it transfer to serial kernel. I am very curious how to avoid it.

more

8d1c641

fzyzcjy changed the title ~~Use few SMs for low-latency mode with almost full speed~~ Use few SMs for low-latency mode Jul 3, 2025

fzyzcjy changed the title ~~Use few SMs for low-latency mode~~ Allow using few SMs for low-latency mode Jul 3, 2025

sphish force-pushed the main branch from 8ff19f5 to bdd119f Compare July 22, 2025 03:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow using few SMs for low-latency mode #277

Allow using few SMs for low-latency mode #277

Uh oh!

fzyzcjy commented Jul 3, 2025 •

edited

Loading

Uh oh!

rubbberrabbit commented Aug 27, 2025 •

edited

Loading

Uh oh!

alpha-baby commented Aug 28, 2025

Uh oh!

rubbberrabbit commented Sep 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Allow using few SMs for low-latency mode #277

Are you sure you want to change the base?

Allow using few SMs for low-latency mode #277

Uh oh!

Conversation

fzyzcjy commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rubbberrabbit commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alpha-baby commented Aug 28, 2025

Uh oh!

rubbberrabbit commented Sep 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fzyzcjy commented Jul 3, 2025 •

edited

Loading

rubbberrabbit commented Aug 27, 2025 •

edited

Loading