[QUESTION] Why is the initialization of the router and experts different in the MoE part? #1302

mxymxy77 · 2024-11-27T01:31:16Z

Why is the initialization of the router and experts different in the MoE part?
The weight parameters of the router are initialized in FP32, while the expert weights are initialized in param_dtype (assuming mixed precision training, the expert weights are initialized in BF16).
Reference link:

lk137095576 · 2024-12-12T02:53:14Z

same question

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] Why is the initialization of the router and experts different in the MoE part? #1302

[QUESTION] Why is the initialization of the router and experts different in the MoE part? #1302

mxymxy77 commented Nov 27, 2024

lk137095576 commented Dec 12, 2024

[QUESTION] Why is the initialization of the router and experts different in the MoE part? #1302

[QUESTION] Why is the initialization of the router and experts different in the MoE part? #1302

Comments

mxymxy77 commented Nov 27, 2024

lk137095576 commented Dec 12, 2024