Add HybridEP solution for normal mode intranode dispatch/combine #420
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hybrid Expert Parallel (Hybrid-EP) Implementation
Overview
This PR introduces the Hybrid Expert Parallel (Hybrid-EP) implementation to the DeepEP library, developed by NVIDIA as an optimized solution for large-scale MoE (Mixture of Experts) model all-to-all communication. This implementation is specifically designed to leverage NVIDIA GPU hardware capabilities, significantly reducing Streaming Multiprocessor (SM) resource usage while dramatically improving communication efficiency and overall throughput.
🎯 Design Goals
🏗️ Core Architecture
Communication Operators
Hierarchical Communication Design
*Note: RDMA functionality will be available in upcoming releases following comprehensive testing.
🔧 Implementation Features
Hardware Optimizations
Supported Data Types
CUDA Graph Integration
*RDMA features are currently under final testing and will be released shortly.
📊 Performance Results
B200 Platform
Test Configuration:
Performance Comparison (Bandwidth in GB/s):
Key Performance Improvements (at 16 SM):
GB200 Platform
Test Configuration:
Note: All bandwidth values represent algorithm bandwidth.
HybridEP Performance Results (Bandwidth in GB/s):
DeepEP Performance Results (Bandwidth in GB/s):
GB200 Performance Highlights:
🏛️ Code Structure
New Files
Build Instructions
Follow the same build process as the main branch. No additional dependencies required.
🚀 Usage Guide
Quick Start
Refer to
tests/test_mnnvlink_hybridep.py
for comprehensive usage examples including:Important Configuration Note
Current Limitation: Due to template-based optimization, parameters in the Python test file must match those defined in
csrc/kernels/hybrid_ep_backend_configs.hpp
. After modifying the header file, recompilation and reinstallation are required.Future Enhancement: We plan to implement Just-In-Time (JIT) compilation to eliminate this manual configuration requirement and improve developer experience.
📋 Implementation Status & Roadmap
✅ Current Features
🚧 Upcoming Features
🎯 Migration Notes
This implementation maintains full backward compatibility with DeepEP. Users can seamlessly integrate Hybrid-EP into existing workflows without code modifications.