[Feat] Single Batch Overlap (SBO): Overlaping of Down GEMM with Combine Send #390

Zqy11 · 2025-09-02T03:34:00Z

1. Motivation

The optimization effect of Two-Batch Overlap (TBO) is suboptimal for the Decode phase on low-compute-power cards (i.e., H20). This is due to two main factors: First, on the Hopper architecture, the WGMMA block_m is 64. Consequently, when TBO is enabled with a small Decode batch size, the MLP GEMM suffers from redundant computations. A positive throughput gain is only observed at larger batch sizes (e.g., 64, 128). Second, at these larger batch sizes, low-compute-power cards like the H20 fail to meet the SLA guarantees for TPOT/ITL.

Therefore, it is necessary to find a solution that can improve Decode throughput even with small batch sizes. Single Batch Overlap (SBO) presents itself as a viable solution.

We implement SBO for DeepSeek v3/R1 by modifying DeepEP and DeepGEMM, including the overlap of Shared Expert and Dispatch Recv, as well as the overlap of Down GEMM with Combine Send.

The overlap of Down GEMM with Combine Send is implemented by modifying SGlang, DeepEP and DeepGEMM, with the detailed implementation available in the PRs below:

DeepEP: this PR.
DeepGEMM: [Feat] Single Batch Overlap (SBO): Overlaping of Down GEMM with Combine Send DeepGEMM#183.

We also conduct integration and evaluation in SGLang: sgl-project/sglang#9660.

2. Overlap Design

SBO implements two overlap for the MoE layers of DeepSeek-V3/R1. One is to overlap the Shared Expert computation with the Dispatch Recv communication, and the other is to overlap the Down GEMM computation with the Combine Send communication.

The interaction between Down GEMM and Combine Send is structured as a producer-consumer model synchronized by signals. For each local expert, a signal unit is allocated for every block_m tokens. The Down GEMM computes the results for these block_m tokens and atomically increments the signaling unit after completing a portion of the work. The Combine Send polls this signaling unit. Once the value reaches a threshold, it sends the corresponding block_m tokens.

3. Modifications

Add the ll_overlap_combine Python interface to support overlapping Down GEMM with Combine Send.
In low_latency_dispatch, change the type of packed_recv_src_info from kInt32 to kInt64 to track the src_rank for each token.
The internode_ll::combine kernel uses fewer SMs when the overlap parameter is enabled and integrates logic for overlap mode, such as polling signals, sending tokens and puting finish flag.

4. Evaluation

We integrated the modified DeepEP and DeepGEMM into SGLang for performance evaluation.

4.1. Experiment Setup

5 nodes, with 8 × H20 GPUs per node. Each prefill node uses TP8, and the other 2 decode nodes use DP_Attn 16 + EP 16.
Input length 4096, output length 1536.

4.2. Performance Evaluation

bs 32, origin

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    4.8
Max request concurrency:                 512
Successful requests:                     10240
Benchmark duration (s):                  2359.16
Total input tokens:                      41943040
Total generated tokens:                  15728640
Total generated tokens (retokenized):    15672509
Request throughput (req/s):              4.34
Input token throughput (tok/s):          17778.82
Output token throughput (tok/s):         6667.06
Total token throughput (tok/s):          24445.88
Concurrency:                             490.01
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   112892.31
Median E2E Latency (ms):                 113847.19
---------------Time to First Token----------------
Mean TTFT (ms):                          640.62
Median TTFT (ms):                        545.06
P99 TTFT (ms):                           1543.37
---------------Inter-Token Latency----------------
Mean ITL (ms):                           73.11
Median ITL (ms):                         71.81
P95 ITL (ms):                            86.02
P99 ITL (ms):                            155.32
Max ITL (ms):                            1543.26
==================================================

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    5.0
Max request concurrency:                 512
Successful requests:                     10240
Benchmark duration (s):                  2357.80
Total input tokens:                      41943040
Total generated tokens:                  15728640
Total generated tokens (retokenized):    15673361
Request throughput (req/s):              4.34
Input token throughput (tok/s):          17789.05
Output token throughput (tok/s):         6670.89
Total token throughput (tok/s):          24459.95
Concurrency:                             490.83
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   113015.97
Median E2E Latency (ms):                 113951.58
---------------Time to First Token----------------
Mean TTFT (ms):                          724.98
Median TTFT (ms):                        624.73
P99 TTFT (ms):                           1693.64
---------------Inter-Token Latency----------------
Mean ITL (ms):                           73.13
Median ITL (ms):                         71.84
P95 ITL (ms):                            86.57
P99 ITL (ms):                            155.21
Max ITL (ms):                            1081.95
==================================================

bs 32, sbo

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    4.8
Max request concurrency:                 512
Successful requests:                     10240
Benchmark duration (s):                  2211.76
Total input tokens:                      41943040
Total generated tokens:                  15728640
Total generated tokens (retokenized):    15673456
Request throughput (req/s):              4.63
Input token throughput (tok/s):          18963.67
Output token throughput (tok/s):         7111.38
Total token throughput (tok/s):          26075.05
Concurrency:                             481.58
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   104017.64
Median E2E Latency (ms):                 105363.65
---------------Time to First Token----------------
Mean TTFT (ms):                          606.28
Median TTFT (ms):                        508.61
P99 TTFT (ms):                           1475.44
---------------Inter-Token Latency----------------
Mean ITL (ms):                           67.35
Median ITL (ms):                         66.10
P95 ITL (ms):                            81.58
P99 ITL (ms):                            141.96
Max ITL (ms):                            1422.74
==================================================

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    5.0
Max request concurrency:                 512
Successful requests:                     10240
Benchmark duration (s):                  2194.12
Total input tokens:                      41943040
Total generated tokens:                  15728640
Total generated tokens (retokenized):    15672577
Request throughput (req/s):              4.67
Input token throughput (tok/s):          19116.14
Output token throughput (tok/s):         7168.55
Total token throughput (tok/s):          26284.70
Concurrency:                             487.92
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   104545.42
Median E2E Latency (ms):                 105483.50
---------------Time to First Token----------------
Mean TTFT (ms):                          619.03
Median TTFT (ms):                        511.23
P99 TTFT (ms):                           1504.27
---------------Inter-Token Latency----------------
Mean ITL (ms):                           67.68
Median ITL (ms):                         66.44
P95 ITL (ms):                            82.13
P99 ITL (ms):                            142.48
Max ITL (ms):                            1024.85
==================================================

4.3. Accuracy Tests

bs 32, origin

#python -m benchmark.gsm8k.bench_sglang --port 8000 --num-questions 1000
100%|█████████████████████████████████████████████████████████████| 1000/1000 [01:20<00:00, 12.41it/s]
Accuracy: 0.951
Invalid: 0.000
Latency: 80.802 s
Output throughput: 1183.468 token/s

bs 32, sbo

#python -m benchmark.gsm8k.bench_sglang --port 8000 --num-questions 1000
100%|█████████████████████████████████████████████████████████████| 1000/1000 [01:17<00:00, 12.87it/s]
Accuracy: 0.950
Invalid: 0.000
Latency: 78.056 s
Output throughput: 1217.443 token/s

4.4. Repro Script

Please refer to sgl-project/sglang#9660.

Co-authored-by: Sulfur6 <[email protected]> Co-authored-by: wangfakang <[email protected]> Co-authored-by: alpha-baby <[email protected]>

fzyzcjy

a very short glance and only tiny nits

fzyzcjy · 2025-09-02T05:42:53Z

csrc/kernels/internode_ll.cu

-        const auto dst_rank = responsible_expert_idx / num_local_experts;
-        const auto local_expert_idx = responsible_expert_idx % num_local_experts;
+    // Shared between warps in sms for overlap mode, where each sm only has one warp group
+    __shared__ int shared_vaild_signal_prefix_sum[288];


nit: would be great to make it a const and assert num experts smaller than this etc

Thanks for the comments. I have added const kNumMaxExperts and assert statements.

fzyzcjy · 2025-09-02T05:43:23Z

deep_ep/buffer.py

+        return combined_x, EventOverlap(event, tensors_to_record if async_finish else None), hook
+
+# noinspection PyTypeChecker
+    def ll_overlap_combine(self, x: torch.Tensor, topk_idx: torch.Tensor, topk_weights: torch.Tensor, handle: tuple, 


nit: wondering whether we should use the existing api since this seems to only add a few flags

(when using in in blackwell sgl-project/sglang#9870 I feel it would be great to have one single function)

Thanks for the comments. We initially added the ll_overlap_combine Python interface to facilitate development and testing. We will reference your implementation and switch to using the low_latency_combine interface directly in our refactoring.

great, looking forward to that (and no worries for now - my 9870 also temporarily use the ll_overlap_combine api now)

Hi, I have deleted ll_overlap_combine and now reuse low_latency_combine.

looks great

fzyzcjy

qq: is this the optimized version, or is there a version that is faster but not put here yet? b/c I see it seems a bit slow and may want to see whether I can try to optimize its speed (on blackwell), thus to double check I do not conflict code with you

Zqy11 · 2025-09-07T02:09:41Z

This is already the latest version, feel free to optimize it.

Co-authored-by: Sulfur6 <[email protected]> Co-authored-by: wangfakang <[email protected]> Co-authored-by: alpha-baby <[email protected]> Co-authored-by: AniZpZ <[email protected]>

Zqy11 and others added 9 commits August 23, 2025 22:28

feat: Overlap down gemm and combine send

9a58e8c

merge main

433a6a4

optimize put finishing flag for overlap mode

42ed271

optimize put finishing flag for overlap mode

389e94b

Merge branch 'main' of https://github.com/Zqy11/DeepEP into feat/overlap

b596396

maintain: delete redundant code

988bad8

maintain: add lambda send_finish_flag

11c5c66

maintain: merge main

99378d5

Co-authored-by: Sulfur6 <[email protected]> Co-authored-by: wangfakang <[email protected]> Co-authored-by: alpha-baby <[email protected]>

Merge branch 'main' of github.com:Zqy11/DeepEP into feat/overlap

6601898

fzyzcjy reviewed Sep 2, 2025

View reviewed changes

Zqy11 added 2 commits September 2, 2025 15:52

maintain: add kNumMaxExperts

b6f500c

maintain: add kNumMaxExperts

f065dd4

This was referenced Sep 2, 2025

[Feat] Single Batch Overlap (SBO): Overlaping of Down GEMM with Combine Send deepseek-ai/DeepGEMM#183

Open

[WIP]Single Batch Overlap for MoE Models sgl-project/sglang#9660

Open

Zqy11 added 2 commits September 5, 2025 17:37

maintain: delete ll_overlap_combine

4694c00

maintain: optimize code

0282e2e

fzyzcjy reviewed Sep 7, 2025

View reviewed changes

Zqy11 marked this pull request as ready for review September 8, 2025 02:42

bugfix

e646364

Co-authored-by: Sulfur6 <[email protected]> Co-authored-by: wangfakang <[email protected]> Co-authored-by: alpha-baby <[email protected]> Co-authored-by: AniZpZ <[email protected]>

Zqy11 force-pushed the feat/overlap branch from 88f090e to e646364 Compare September 8, 2025 08:09

Zqy11 added 2 commits September 11, 2025 10:42

Merge branch 'main' of github.com:Zqy11/DeepEP into feat/overlap

40a91c5

maintain: modify comments

93e3d53

Sulfur6 mentioned this pull request Sep 11, 2025

The current strategy in get_best_config is suboptimal in some cases on H20 deepseek-ai/DeepGEMM#191

Open

maintain: merge main

91daac2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feat] Single Batch Overlap (SBO): Overlaping of Down GEMM with Combine Send #390

[Feat] Single Batch Overlap (SBO): Overlaping of Down GEMM with Combine Send #390

Uh oh!

Zqy11 commented Sep 2, 2025 •

edited

Loading

Uh oh!

fzyzcjy left a comment •

edited

Loading

Uh oh!

fzyzcjy Sep 2, 2025

Uh oh!

Zqy11 Sep 2, 2025

Uh oh!

fzyzcjy Sep 2, 2025 •

edited

Loading

Uh oh!

Zqy11 Sep 2, 2025

Uh oh!

fzyzcjy Sep 2, 2025 •

edited

Loading

Uh oh!

Zqy11 Sep 5, 2025

Uh oh!

fzyzcjy Sep 5, 2025

Uh oh!

fzyzcjy left a comment •

edited

Loading

Uh oh!

Zqy11 commented Sep 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

[Feat] Single Batch Overlap (SBO): Overlaping of Down GEMM with Combine Send #390

Are you sure you want to change the base?

[Feat] Single Batch Overlap (SBO): Overlaping of Down GEMM with Combine Send #390

Uh oh!

Conversation

Zqy11 commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Motivation

2. Overlap Design

3. Modifications

4. Evaluation

4.1. Experiment Setup

4.2. Performance Evaluation

4.3. Accuracy Tests

4.4. Repro Script

Uh oh!

fzyzcjy left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fzyzcjy Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

Zqy11 Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

fzyzcjy Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Zqy11 Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

fzyzcjy Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Zqy11 Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

fzyzcjy Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

fzyzcjy left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Zqy11 commented Sep 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Zqy11 commented Sep 2, 2025 •

edited

Loading

fzyzcjy left a comment •

edited

Loading

fzyzcjy Sep 2, 2025 •

edited

Loading

fzyzcjy Sep 2, 2025 •

edited

Loading

fzyzcjy left a comment •

edited

Loading

Zqy11 commented Sep 7, 2025 •

edited

Loading