Skip to content

Commit 0641c08

Browse files
committed
add permute/unpermute, add sync free, refactor
1 parent c5a3e9b commit 0641c08

File tree

14 files changed

+1903
-681
lines changed

14 files changed

+1903
-681
lines changed

Hybrid-EP_Intra-node_Implementation.md

Lines changed: 24 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -118,19 +118,25 @@ This document introduces the Hybrid Expert Parallel (Hybrid-EP) implementation t
118118

119119
### New Files
120120
```
121-
csrc/
122-
├── hybrid_ep.cu # Main CUDA implementation
123-
├── hybrid_ep.cuh # Header definitions
124-
└── kernels/
125-
├── hybrid_ep_backend.cuh # Backend core implementation
126-
└── hybrid_ep_backend_configs.hpp # Configuration parameters
121+
csrc/hybrid_ep/
122+
├── hybrid_ep.cu # Main CUDA implementation
123+
├── hybrid_ep.cuh # Header definitions
124+
├── pybind_hybrid_ep.cu # PyBind bindings
125+
├── config.cuh # Config definitions required by hybrid-EP kernels
126+
├── utils.cuh # Utility helpers and macros
127+
├── allocator/ # Allocator for memory accessible by remote ranks
128+
├── backend/ # Core Hybrid-EP kernel implementations
129+
│ └── hybrid_ep_backend.cuh
130+
├── executor/ # Kernel runner
131+
├── extension/ # Useful extensions
132+
└── jit/ # JIT compiler
127133
128134
deep_ep/
129-
├── hybrid_ep_buffer.py # Python interface
130-
└── buffer.py # Buffer management
135+
├── hybrid_ep_buffer.py # Python interface
136+
└── buffer.py # Buffer management
131137
132138
tests/
133-
└── test_mnnvlink_hybridep.py # Multi-node NVLink testing and Intra-node testing
139+
└── test_hybrid_ep.py # Hybrid-EP tests
134140
```
135141

136142
### Build Instructions
@@ -139,33 +145,29 @@ Follow the same build process as the main branch. No additional dependencies req
139145
## 🚀 Usage Guide
140146

141147
### Quick Start
142-
Refer to `tests/test_mnnvlink_hybridep.py` for comprehensive usage examples including:
143-
- Multi-node NVLink configuration
148+
Refer to `tests/test_hybrid_ep.py` for comprehensive usage examples including:
149+
- Multi-node configuration
144150
- Intra-node testing scenarios
151+
- Inter-node testing will come soon
145152
- Performance benchmarking setups
146153

147154
### Important Configuration Note
148-
**Current Limitation**: Due to template-based optimization, parameters in the Python test file must match those defined in `csrc/kernels/hybrid_ep_backend_configs.hpp`. After modifying the header file, recompilation and reinstallation are required. Here are Important parameter settings in `hybrid_ep_backend_configs.hpp`:
155+
Here are important parameter settings in `csrc/hybrid_ep/config.cuh`. You can modify these parameters via `HybridEpBuffer.init_config()` or by setting proper environment variables (see `deep_ep/hybrid_ep_buffer.py`) to achieve better performance/usability:
149156

150157
- HIDDEN_DIM
151-
Must match the Python side: see `tests/test_mnnvlink_hybridep.py` (`HIDDEN_DIM`).
158+
Hidden size (must match model hidden dimension).
152159

153-
- MAX_NUM_OF_TOKENS_PER_RANK
154-
Must match the Python side: see `tests/test_mnnvlink_hybridep.py` (`MAX_NUM_OF_TOKENS_PER_RANK`).
160+
- MAX_NUM_OF_TOKENS_PER_RANK
161+
The largest sequence length for the input of the dispatch kernel.
155162

156163
- NUM_OF_EXPERTS_PER_RANK
157-
Must match the Python side: see `tests/test_mnnvlink_hybridep.py` (`NUM_LOCAL_EXPERTS`).
164+
Number of experts hosted by each rank.
158165

159166
- NUM_OF_NODES
160-
Number of NVLink domains**, not the number of OS nodes / containers.
167+
**Number of NVLink domains**, not the number of OS nodes / containers.
161168

162169
- NUM_OF_RANKS_PER_NODE
163170
Number of ranks within one NVLink domain.
164-
Must match the Python side: see `tests/test_mnnvlink_hybridep.py` (`NUM_OF_RANKS_PER_NODE` & `args.num_processes`).
165-
166-
- USE_MNNVLINK
167-
`false` → single-node (pure NVLink inside one box)
168-
`true` → multi-node NVLink domain (MNNVLINK enabled).
169171

170172
- NUM_THREADS_PER_BLOCK_PREPROCESSING_API
171173
Thread-block width for the preprocessing kernel.
@@ -181,9 +183,6 @@ Refer to `tests/test_mnnvlink_hybridep.py` for comprehensive usage examples incl
181183
- NUM_OF_BLOCKS_DISPATCH_API
182184
Number of CTAs to launch for dispatch; controls how many SMs are used.
183185

184-
- FORWARD_DISPATCH_API
185-
Set to `true` for forward dispatch path, `false` for backward.
186-
187186
- NUM_OF_STAGES_G2S_COMBINE_API
188187
Pipeline depth for global-to-shared (G2S) in combine.
189188
Same shared-memory trade-off as dispatch.
@@ -195,12 +194,6 @@ Refer to `tests/test_mnnvlink_hybridep.py` for comprehensive usage examples incl
195194
- NUM_OF_BLOCKS_COMBINE_API
196195
Number of CTAs for combine kernels.
197196

198-
- BACKWARD_COMBINE_API
199-
Set to `true` for backward combine path, `false` for forward.
200-
201-
**Future Enhancement**: We plan to implement Just-In-Time (JIT) compilation to eliminate this manual configuration requirement and improve developer experience.
202-
203-
204197
---
205198

206199
## 📋 Implementation Status & Roadmap
@@ -215,12 +208,9 @@ Refer to `tests/test_mnnvlink_hybridep.py` for comprehensive usage examples incl
215208
### 🚧 Upcoming Features
216209
- **Low Latency Mode**: Enhanced performance for latency-critical workloads
217210
- **RDMA Integration**: High-performance inter-node communication
218-
- **JIT Compilation**: Dynamic parameter configuration without recompilation
219211

220212
### ⚠️ Current Limitations
221-
- Template-based implementation requires recompilation for parameter changes
222213
- RDMA functionality not yet available (under final testing)
223-
- Configuration parameters must be manually synchronized between Python and C++ files
224214

225215
### 🎯 Migration Notes
226216
This implementation maintains full backward compatibility with DeepEP. Users can seamlessly integrate Hybrid-EP into existing workflows without code modifications.

0 commit comments

Comments
 (0)