You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
└── test_mnnvlink_hybridep.py # Multi-node NVLink testing and Intra-node testing
139
+
└── test_hybrid_ep.py # Hybrid-EP tests
134
140
```
135
141
136
142
### Build Instructions
@@ -139,33 +145,29 @@ Follow the same build process as the main branch. No additional dependencies req
139
145
## 🚀 Usage Guide
140
146
141
147
### Quick Start
142
-
Refer to `tests/test_mnnvlink_hybridep.py` for comprehensive usage examples including:
143
-
- Multi-node NVLink configuration
148
+
Refer to `tests/test_hybrid_ep.py` for comprehensive usage examples including:
149
+
- Multi-node configuration
144
150
- Intra-node testing scenarios
151
+
- Inter-node testing will come soon
145
152
- Performance benchmarking setups
146
153
147
154
### Important Configuration Note
148
-
**Current Limitation**: Due to template-based optimization, parameters in the Python test file must match those defined in `csrc/kernels/hybrid_ep_backend_configs.hpp`. After modifying the header file, recompilation and reinstallation are required. Here are Important parameter settings in `hybrid_ep_backend_configs.hpp`:
155
+
Here are important parameter settings in `csrc/hybrid_ep/config.cuh`. You can modify these parameters via `HybridEpBuffer.init_config()` or by setting proper environment variables (see `deep_ep/hybrid_ep_buffer.py`) to achieve better performance/usability:
149
156
150
157
- HIDDEN_DIM
151
-
Must match the Python side: see `tests/test_mnnvlink_hybridep.py` (`HIDDEN_DIM`).
158
+
Hidden size (must match model hidden dimension).
152
159
153
-
- MAX_NUM_OF_TOKENS_PER_RANK
154
-
Must match the Python side: see `tests/test_mnnvlink_hybridep.py` (`MAX_NUM_OF_TOKENS_PER_RANK`).
160
+
- MAX_NUM_OF_TOKENS_PER_RANK
161
+
The largest sequence length for the input of the dispatch kernel.
155
162
156
163
- NUM_OF_EXPERTS_PER_RANK
157
-
Must match the Python side: see `tests/test_mnnvlink_hybridep.py` (`NUM_LOCAL_EXPERTS`).
164
+
Number of experts hosted by each rank.
158
165
159
166
- NUM_OF_NODES
160
-
Number of NVLink domains**, not the number of OS nodes / containers.
167
+
**Number of NVLink domains**, not the number of OS nodes / containers.
161
168
162
169
- NUM_OF_RANKS_PER_NODE
163
170
Number of ranks within one NVLink domain.
164
-
Must match the Python side: see `tests/test_mnnvlink_hybridep.py` (`NUM_OF_RANKS_PER_NODE` & `args.num_processes`).
165
-
166
-
- USE_MNNVLINK
167
-
`false` → single-node (pure NVLink inside one box)
@@ -181,9 +183,6 @@ Refer to `tests/test_mnnvlink_hybridep.py` for comprehensive usage examples incl
181
183
- NUM_OF_BLOCKS_DISPATCH_API
182
184
Number of CTAs to launch for dispatch; controls how many SMs are used.
183
185
184
-
- FORWARD_DISPATCH_API
185
-
Set to `true` for forward dispatch path, `false` for backward.
186
-
187
186
- NUM_OF_STAGES_G2S_COMBINE_API
188
187
Pipeline depth for global-to-shared (G2S) in combine.
189
188
Same shared-memory trade-off as dispatch.
@@ -195,12 +194,6 @@ Refer to `tests/test_mnnvlink_hybridep.py` for comprehensive usage examples incl
195
194
- NUM_OF_BLOCKS_COMBINE_API
196
195
Number of CTAs for combine kernels.
197
196
198
-
- BACKWARD_COMBINE_API
199
-
Set to `true` for backward combine path, `false` for forward.
200
-
201
-
**Future Enhancement**: We plan to implement Just-In-Time (JIT) compilation to eliminate this manual configuration requirement and improve developer experience.
202
-
203
-
204
197
---
205
198
206
199
## 📋 Implementation Status & Roadmap
@@ -215,12 +208,9 @@ Refer to `tests/test_mnnvlink_hybridep.py` for comprehensive usage examples incl
215
208
### 🚧 Upcoming Features
216
209
-**Low Latency Mode**: Enhanced performance for latency-critical workloads
217
210
-**RDMA Integration**: High-performance inter-node communication
218
-
-**JIT Compilation**: Dynamic parameter configuration without recompilation
219
211
220
212
### ⚠️ Current Limitations
221
-
- Template-based implementation requires recompilation for parameter changes
222
213
- RDMA functionality not yet available (under final testing)
223
-
- Configuration parameters must be manually synchronized between Python and C++ files
224
214
225
215
### 🎯 Migration Notes
226
216
This implementation maintains full backward compatibility with DeepEP. Users can seamlessly integrate Hybrid-EP into existing workflows without code modifications.
0 commit comments