Skip to content

Conversation

Zqy11
Copy link

@Zqy11 Zqy11 commented Sep 2, 2025

1. Motivation

The optimization effect of Two-Batch Overlap (TBO) is suboptimal for the Decode phase on low-compute-power cards (i.e., H20). This is due to two main factors: First, on the Hopper architecture, the WGMMA block_m is 64. Consequently, when TBO is enabled with a small Decode batch size, the MLP GEMM suffers from redundant computations. A positive throughput gain is only observed at larger batch sizes (e.g., 64, 128). Second, at these larger batch sizes, low-compute-power cards like the H20 fail to meet the SLA guarantees for TPOT/ITL.

Therefore, it is necessary to find a solution that can improve Decode throughput even with small batch sizes. Single Batch Overlap (SBO) presents itself as a viable solution.

We implement SBO for DeepSeek v3/R1 by modifying DeepEP and DeepGEMM, including the overlap of Shared Expert and Dispatch Recv, as well as the overlap of Down GEMM with Combine Send.

The overlap of Down GEMM with Combine Send is implemented by modifying SGlang, DeepEP and DeepGEMM, with the detailed implementation available in the PRs below:

We also conduct integration and evaluation in SGLang: sgl-project/sglang#9660.

2. Overlap Design

SBO implements two overlap for the MoE layers of DeepSeek-V3/R1. One is to overlap the Shared Expert computation with the Dispatch Recv communication, and the other is to overlap the Down GEMM computation with the Combine Send communication.
image
The interaction between Down GEMM and Combine Send is structured as a producer-consumer model synchronized by signals. For each local expert, a signal unit is allocated for every block_m tokens. The Down GEMM computes the results for these block_m tokens and atomically increments the signaling unit after completing a portion of the work. The Combine Send polls this signaling unit. Once the value reaches a threshold, it sends the corresponding block_m tokens.
image

3. Modifications

  • Add the ll_overlap_combine Python interface to support overlapping Down GEMM with Combine Send.
  • In low_latency_dispatch, change the type of packed_recv_src_info from kInt32 to kInt64 to track the src_rank for each token.
  • The internode_ll::combine kernel uses fewer SMs when the overlap parameter is enabled and integrates logic for overlap mode, such as polling signals, sending tokens and puting finish flag.

4. Evaluation

We integrated the modified DeepEP and DeepGEMM into SGLang for performance evaluation.

4.1. Experiment Setup

  • 5 nodes, with 8 × H20 GPUs per node. Each prefill node uses TP8, and the other 2 decode nodes use DP_Attn 16 + EP 16.
  • Input length 4096, output length 1536.

4.2. Performance Evaluation

  • bs 32, origin
image
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    4.8
Max request concurrency:                 512
Successful requests:                     10240
Benchmark duration (s):                  2359.16
Total input tokens:                      41943040
Total generated tokens:                  15728640
Total generated tokens (retokenized):    15672509
Request throughput (req/s):              4.34
Input token throughput (tok/s):          17778.82
Output token throughput (tok/s):         6667.06
Total token throughput (tok/s):          24445.88
Concurrency:                             490.01
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   112892.31
Median E2E Latency (ms):                 113847.19
---------------Time to First Token----------------
Mean TTFT (ms):                          640.62
Median TTFT (ms):                        545.06
P99 TTFT (ms):                           1543.37
---------------Inter-Token Latency----------------
Mean ITL (ms):                           73.11
Median ITL (ms):                         71.81
P95 ITL (ms):                            86.02
P99 ITL (ms):                            155.32
Max ITL (ms):                            1543.26
==================================================
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    5.0
Max request concurrency:                 512
Successful requests:                     10240
Benchmark duration (s):                  2357.80
Total input tokens:                      41943040
Total generated tokens:                  15728640
Total generated tokens (retokenized):    15673361
Request throughput (req/s):              4.34
Input token throughput (tok/s):          17789.05
Output token throughput (tok/s):         6670.89
Total token throughput (tok/s):          24459.95
Concurrency:                             490.83
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   113015.97
Median E2E Latency (ms):                 113951.58
---------------Time to First Token----------------
Mean TTFT (ms):                          724.98
Median TTFT (ms):                        624.73
P99 TTFT (ms):                           1693.64
---------------Inter-Token Latency----------------
Mean ITL (ms):                           73.13
Median ITL (ms):                         71.84
P95 ITL (ms):                            86.57
P99 ITL (ms):                            155.21
Max ITL (ms):                            1081.95
==================================================
  • bs 32, sbo
image
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    4.8
Max request concurrency:                 512
Successful requests:                     10240
Benchmark duration (s):                  2211.76
Total input tokens:                      41943040
Total generated tokens:                  15728640
Total generated tokens (retokenized):    15673456
Request throughput (req/s):              4.63
Input token throughput (tok/s):          18963.67
Output token throughput (tok/s):         7111.38
Total token throughput (tok/s):          26075.05
Concurrency:                             481.58
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   104017.64
Median E2E Latency (ms):                 105363.65
---------------Time to First Token----------------
Mean TTFT (ms):                          606.28
Median TTFT (ms):                        508.61
P99 TTFT (ms):                           1475.44
---------------Inter-Token Latency----------------
Mean ITL (ms):                           67.35
Median ITL (ms):                         66.10
P95 ITL (ms):                            81.58
P99 ITL (ms):                            141.96
Max ITL (ms):                            1422.74
==================================================
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    5.0
Max request concurrency:                 512
Successful requests:                     10240
Benchmark duration (s):                  2194.12
Total input tokens:                      41943040
Total generated tokens:                  15728640
Total generated tokens (retokenized):    15672577
Request throughput (req/s):              4.67
Input token throughput (tok/s):          19116.14
Output token throughput (tok/s):         7168.55
Total token throughput (tok/s):          26284.70
Concurrency:                             487.92
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   104545.42
Median E2E Latency (ms):                 105483.50
---------------Time to First Token----------------
Mean TTFT (ms):                          619.03
Median TTFT (ms):                        511.23
P99 TTFT (ms):                           1504.27
---------------Inter-Token Latency----------------
Mean ITL (ms):                           67.68
Median ITL (ms):                         66.44
P95 ITL (ms):                            82.13
P99 ITL (ms):                            142.48
Max ITL (ms):                            1024.85
==================================================

4.3. Accuracy Tests

  • bs 32, origin
#python -m benchmark.gsm8k.bench_sglang --port 8000 --num-questions 1000
100%|█████████████████████████████████████████████████████████████| 1000/1000 [01:20<00:00, 12.41it/s]
Accuracy: 0.951
Invalid: 0.000
Latency: 80.802 s
Output throughput: 1183.468 token/s
  • bs 32, sbo
#python -m benchmark.gsm8k.bench_sglang --port 8000 --num-questions 1000
100%|█████████████████████████████████████████████████████████████| 1000/1000 [01:17<00:00, 12.87it/s]
Accuracy: 0.950
Invalid: 0.000
Latency: 78.056 s
Output throughput: 1217.443 token/s

4.4. Repro Script

Please refer to sgl-project/sglang#9660.

Copy link
Contributor

@fzyzcjy fzyzcjy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a very short glance and only tiny nits

const auto dst_rank = responsible_expert_idx / num_local_experts;
const auto local_expert_idx = responsible_expert_idx % num_local_experts;
// Shared between warps in sms for overlap mode, where each sm only has one warp group
__shared__ int shared_vaild_signal_prefix_sum[288];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would be great to make it a const and assert num experts smaller than this etc

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments. I have added const kNumMaxExperts and assert statements.

return combined_x, EventOverlap(event, tensors_to_record if async_finish else None), hook

# noinspection PyTypeChecker
def ll_overlap_combine(self, x: torch.Tensor, topk_idx: torch.Tensor, topk_weights: torch.Tensor, handle: tuple,
Copy link
Contributor

@fzyzcjy fzyzcjy Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: wondering whether we should use the existing api since this seems to only add a few flags

(when using in in blackwell sgl-project/sglang#9870 I feel it would be great to have one single function)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments. We initially added the ll_overlap_combine Python interface to facilitate development and testing. We will reference your implementation and switch to using the low_latency_combine interface directly in our refactoring.

Copy link
Contributor

@fzyzcjy fzyzcjy Sep 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, looking forward to that (and no worries for now - my 9870 also temporarily use the ll_overlap_combine api now)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I have deleted ll_overlap_combine and now reuse low_latency_combine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great

Copy link
Contributor

@fzyzcjy fzyzcjy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: is this the optimized version, or is there a version that is faster but not put here yet? b/c I see it seems a bit slow and may want to see whether I can try to optimize its speed (on blackwell), thus to double check I do not conflict code with you

@Zqy11
Copy link
Author

Zqy11 commented Sep 7, 2025

This is already the latest version, feel free to optimize it.

@Zqy11 Zqy11 marked this pull request as ready for review September 8, 2025 02:42
Co-authored-by: Sulfur6 <[email protected]>
Co-authored-by: wangfakang <[email protected]>
Co-authored-by: alpha-baby <[email protected]>
Co-authored-by: AniZpZ <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants