You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Why is the initialization of the router and experts different in the MoE part?
The weight parameters of the router are initialized in FP32, while the expert weights are initialized in param_dtype (assuming mixed precision training, the expert weights are initialized in BF16).
Reference link:
Why is the initialization of the router and experts different in the MoE part?
The weight parameters of the router are initialized in FP32, while the expert weights are initialized in param_dtype (assuming mixed precision training, the expert weights are initialized in BF16).
Reference link:
The text was updated successfully, but these errors were encountered: