Skip to content

Conversation

ZhiyiHu1999
Copy link
Contributor

@ZhiyiHu1999 ZhiyiHu1999 commented Jul 31, 2025

1. Motivation

In MOE training and the prefilling phase of inference, the current ring-based RDMA buffer implementation for normal kernels significantly wastes SM resources. Due to the small size of the ring buffer and its frequent reuse for token transmission, SMs are often stalled, continuously polling for RDMA buffer availability instead of performing useful computation. This inefficient resource usage severely limits overall system throughput.

To address this issue, this PR implements a SM-friendly buffer design that frees SMs from buffer polling duties. This design is inspired by the large RDMA buffer approach discussed in #39. By allocating a larger RDMA buffer in HBM, tokens can be moved from SMs to the RDMA buffer in one go. While the NIC handles transmission asynchronously in the background, SMs can immediately resume computation tasks. Once data transfer completes, SMs can process the received tokens without blocking.

2. Design

2.1. Feature

The SM Free mode decouples the execution phases of the native mode: when the user first launches internode dispatch/combine, only the send phase is executed and a recv hook is returned. The user can wait for the network transmission to complete and then launch the receive phase of the internode dispatch/combine using the recv hook.
Implementation

2.2.1. Principles

  • Full Compatibility: Seamlessly integrates with existing native communication modes, requiring minimal changes to existing codebases

2.2.2. Highlights

  • Mode Control via return_recv_hook: Introduces a user-controllable argument return_recv_hook to switch between native mode and hook mode.
  • RDMA Buffer Management:
    • Retains the native buffer structure for ease of integration.
    • Enlarges token capacity per RDMA recv chunk to fit all tokens in one batch.
    • Provides a utility function get_normal_hook_rdma_size_hint() to help users estimate the minimum required RDMA buffer size.
  • Improved Compute Stream Utilization:
    • In hook mode, the kernel runs on the compute stream, enabling more efficient offloading of data transfer tasks to the NIC and the network while freeing up SMs.
    • Compared to native mode (which allocates 2 SMs per channel), hook mode maps one channel per SM, improving scalability and SM utilization.
image

3. Performance Evaluation

3.1. Experiment Setup

  • 2/4 nodes, with 8 × H20 GPUs per node.
  • 4096 tokens per batch, 7168 hidden, top-8 experts.

3.2. Effect

3.2.1. Estimated Performance (native mode → hook mode)

  • A GEMM operation of shape [4096, 4096] is used to overlap the network phase. It typically takes 4-5 μs in my system, sufficient for completing the token transmission.
  • For hook mode, the kernel execution time is the sum of the send kernel time and the receive kernel time, and the RDMA bandwidth is calculated as: RDMA Bandwidth = RDMA Bytes / Kernel Execution Time

Dispatch Kernel Execution Time & Bandwidth

#EP FP8 Dispatch
Kernel Execution Time
FP8 Dispatch
RDMA Bandwidth
BF16 Dispatch
Kernel Execution Time
BF16 Dispatch
RDMA Bandwidth
16 1358 → 765 μs 44 → 79 GB/s 2530 → 1381 μs 46 → 85 GB/s
32 3535 → 884 μs 30 → 124 GB/s 6780 → 2468 μs 32 → 86 GB/s

Combine Kernel Execution Time & Bandwidth

#EP BF16 Combine
Kernel Execution Time
BF16 Combine
RDMA Bandwidth
16 2590 → 1425 μs 45 → 82 GB/s
32 6850 → 2250 μs 31 → 81 GB/s

3.3. Cost

3.3.1. HBM Cost

The main HBM cost in hook mode comes from two sides:

  • RDMA Buffer:
    rdma_buffer_size = num_max_dispatch_tokens_per_rank × hidden_size × size_of(element) × num_nodes × 2
    The HBM cost from using the large RDMA buffer increases with num_max_dispatch_tokens_per_rank and num_nodes; for our experiment setup, the HBM cost for the RDMA buffer on each rank is about 270 MB.

  • NVLink Buffer:
    nvl_buffer_size = num_max_nvl_chunked_recv_tokens × hidden_size × size_of(element) × num_nvl_peers × num_channels
    HBM used for NVLink buffer increased with the number of SMs (channels) in the hook mode, but the number of tokens assigned to each channel is smaller, so the nvl_recv_chunk could be shrunk and the overall HBM cost for this part changed slightly.

3.3.2. Bandwidth Cost

For SM Free mode, the data movement in the recv phase of dispatch and the send phase of combine are inter-GPU through NVLink. So the respective kernel execution time is also limited by the NVLink bandwidth.


4. RoadMap

  • Code Refactor
  • Optimize HBM Usage
  • Implement Dynamic Resizing in RDMA Buffer
  • More TMA Optimization in Normal Hook mode

@ZhiyiHu1999 ZhiyiHu1999 marked this pull request as draft July 31, 2025 07:05
@ZhiyiHu1999 ZhiyiHu1999 marked this pull request as ready for review August 1, 2025 12:30
@polarstormx
Copy link
Contributor

polarstormx commented Aug 4, 2025

nvl_buffer_size = num_max_net_channel_recv_tokens × hidden_size × size_of(element) × num_gpus_per_rank × num_channels

Maybe you mean "num_gpus_per_node"?

@sphish
Copy link
Collaborator

sphish commented Aug 13, 2025

The original implementation allows NVLink and RDMA transfers to be pipelined, enabling us to utilize both NVLink and RDMA bandwidth simultaneously. I think it is worthwhile to dedicate some SMs for this purpose.
In the next refactoring, we will try to minimize the usage of SMs.

@ZhiyiHu1999 ZhiyiHu1999 force-pushed the feature/sm_free_normal_kernel branch from afcd1aa to 3676f96 Compare August 29, 2025 02:10
@ZhiyiHu1999 ZhiyiHu1999 force-pushed the feature/sm_free_normal_kernel branch from 3676f96 to a35f83b Compare September 11, 2025 09:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants