-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[Unified MoE Layer]: Add MoE Layer with DeepEP EP Support; Add Qwen3MoE EP #2702
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: llbdyiu66 <[email protected]>
Co-authored-by: llbdyiu66 <[email protected]>
Codecov Report❌ Patch coverage is ❌ Your patch status has failed because the patch coverage (26.51%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #2702 +/- ##
==========================================
Coverage ? 28.74%
==========================================
Files ? 343
Lines ? 57154
Branches ? 0
==========================================
Hits ? 16427
Misses ? 40727
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a modular MoE (Mixture of Experts) implementation to PaddlePaddle, integrating it with the Qwen3-MoE model. The changes introduce a flexible, extensible MoE framework with customizable gates, experts, communication strategies, and loss functions.
- Adds a new modular MoE layer system supporting Expert Parallel (EP) mode
- Integrates the new MoE implementation with Qwen3-MoE model architecture
- Implements configurable loss functions and combiners for MoE training
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 54 comments.
Show a summary per file
| File | Description |
|---|---|
| paddleformers/transformers/qwen3_moe/modeling.py | Integrates QuickAccessMoEFactory and updates tensor parallel mappings to support fused attention and EP mode |
| paddleformers/nn/moe_deepep/moe_loss_instance.py | Defines global loss registry instance and custom loss functions |
| paddleformers/nn/moe_deepep/moe_loss.py | Implements flexible loss system with multiple loss types and combiners |
| paddleformers/nn/moe_deepep/moe_gate.py | Implements standard and flexible MoE gate mechanisms with routing strategies |
| paddleformers/nn/moe_deepep/moe_factory.py | Factory pattern for creating MoE layers from model configs |
| paddleformers/nn/moe_deepep/moe_expert.py | Expert network implementations for MoE layers |
| paddleformers/nn/moe_deepep/moe_config.py | Configuration dictionary for different MoE model types |
| paddleformers/nn/moe_deepep/moe_communication.py | Communication strategies for Expert Parallel training |
| paddleformers/nn/moe_deepep/modular_moe_layer.py | Main modular MoE layer implementation |
| paddleformers/nn/moe_deepep/init.py | Module initialization and lazy imports |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| from ...nn.linear import Linear as GeneralLinear | ||
| from ...nn.lm_head import LMHead as GeneralLMHead | ||
| from ...nn.mlp import MLP | ||
| from ...nn.moe_deepep.moe_factory import QuickAccessMoEFactory |
This comment was marked as abuse.
This comment was marked as abuse.
Sorry, something went wrong.
| def _probs_drop_policy( | ||
| self, | ||
| scores: torch.Tensor, | ||
| capacity: int, | ||
| ) -> torch.Tensor: |
Copilot
AI
Nov 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect type annotation. Should use paddle.Tensor instead of torch.Tensor to match the PaddlePaddle framework being used.
| 2. Its score for that expert is among the top 'capacity' scores for that expert. | ||
| Args: | ||
| scores (torch.Tensor): [num_tokens, num_total_experts]. |
Copilot
AI
Nov 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation refers to torch.Tensor but should reference paddle.Tensor to match the framework being used.
| (Not strictly used here, but good practice to include). | ||
| Returns: | ||
| torch.Tensor: [num_tokens, num_total_experts] boolean mask (converted to float). |
Copilot
AI
Nov 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return type documentation refers to torch.Tensor but should reference paddle.Tensor to match the framework being used.
|
|
||
| # --- Step 1: Find the 'capacity' best tokens for *each* expert --- | ||
|
|
||
| # Use torch.topk along dim=0 (the token dimension) to find the indices |
Copilot
AI
Nov 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment incorrectly mentions torch.topk but the actual implementation uses paddle.topk.
|
|
||
| def __call__(self, losses: Dict[str, paddle.Tensor], configs: Dict[str, LossConfig]) -> paddle.Tensor: | ||
| """组合多个损失函数""" | ||
| ... |
Copilot
AI
Nov 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This statement has no effect.
|
| config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0 | ||
| ): | ||
| self.mlp = Qwen3MoeSparseMoeBlock(config) | ||
| self.mlp = QuickAccessMoEFactory.create_from_model_name(config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不要删除原本的Qwen3MoeSparseMoeBlock,通过是否打开EP来选择使用哪个类
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已添加 if expert_parallel_degree > 1 再使用
|
|
||
| LAYER_ROWWISE = ["self_attn.o_proj.weight"] | ||
|
|
||
| FUSE_LAYER_COLWISE = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么要增加这个self_attn.qkv_proj.weight
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
相关内容都删了
| "gate_proj.weight", | ||
| ] | ||
|
|
||
| FUSE_EXPERT_LAYER_COLWISE = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
| "self_attn.v_proj.bias", | ||
| ] | ||
|
|
||
| FUSE_BIAS_KEYS = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上
| if expert_parallel_degree <= 1: | ||
| # # if disable_ffn_model_parallel is True, disable expert layer tp plan | ||
| # if not config.disable_ffn_model_parallel: | ||
| if not config.fuse_attention_ffn: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
没有fuse_attention_ffn开关为什么要添加?这些?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
相关内容都删了
| elif self.expert_parallel_degree > 1 and self.tensor_parallel_degree >= 1: | ||
| routed_expert_pretrained_config.tensor_parallel_degree = 1 | ||
|
|
||
| # self.experts = nn.LayerList( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果不需要的代码直接删除
|
|
||
| self.experts = nn.LayerList( | ||
| [ | ||
| MLP(config=routed_expert_pretrained_config, intermediate_size=pretrained_config.moe_intermediate_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不是所有模型的MLP都可以通用的,建议初始化的时候从模型里传入MLP 类,提高自由度
| else: | ||
| self.communication = DeepEPMoECommunication() | ||
|
|
||
| # self.is_dummy_moe = False if self.expert_parallel_degree > 1 else True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
注释掉的代码删除
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
全局都注意这个事情
| from paddle import nn | ||
| from paddle.incubate.nn.functional import swiglu as fused_swiglu | ||
|
|
||
| from ...nn.mlp import MLP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在复现单卡模型的时候都会复现对应的expert,建议直接复用组网中的expert减少开发量和后续的维护成本(不如组网有改动,不需要改动一次组网里的expert,再改动一次这里的expert)。可以保留一个standard,但建议使用方法还是采用模型组网里本身使用的MLP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
现在都改成复用组网的 expert 了
|
|
||
| MOE_CONFIG = { | ||
| "qwen3_moe": { | ||
| "gate_activation": "softmax", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已按要求修改
|
需要一个完整的文档对FlashMoe模块介绍 |
Co-authored-by: Copilot <[email protected]>
| if self.custom_communication is not None: | ||
| self.communication = self.custom_communication | ||
| else: | ||
| if os.getenv("USE_DEEPEP", "1") == "0": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不要用环境变量来控制使用的方式,选择通信方式,使用config之类来控制
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
增加了一个 ep_communication_type yaml 配置项
|
已更新一版,复测 Qwen3MoE EP 通过 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM

PR types
New features
PR changes
Models
Description
这个 PR 添加了一个通用的 MoE Layer 模块,采用模块化设计。
每个模型中的 MoE 层都可以替换为该通用模块,只需要在 moe_config.py 中配置该模型使用的激活函数、TopK计算方法等信息。
测试
已验证标准 Qwen3MoE 模型 SFT 的以下配置:
未完成部分