You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[QUESTION] Why BF16 has FP32 for param gradient at the first time unlike FP16 has FP16 for param gradient at the first time and switches them to FP32 when update param?
#1335
Open
renyinCheng001 opened this issue
Dec 24, 2024
· 0 comments
I saw that Megatron-LM supports the configuration of FP16 and BF16 mixed-precision data types, and I found that these two different data type configurations correspond to different parameter types and gradient types:
FP16 Training
BF16 Training
Weight
FP16
BF16
Gradient
FP16
FP32
When training with FP16, a copy of the FP32 gradient is required before updating the parameters
When training with BF16, no additional copies are required because the gradient itself is FP32
So, why BF16 has FP32 for param gradient at the first time unlike FP16 has FP16 for param gradient at the first time and switches them to FP32 when update param?
Anyone's reply will be helpful to me, thanks
The text was updated successfully, but these errors were encountered:
Hi, All
I saw that Megatron-LM supports the configuration of FP16 and BF16 mixed-precision data types, and I found that these two different data type configurations correspond to different parameter types and gradient types:
So, why BF16 has FP32 for param gradient at the first time unlike FP16 has FP16 for param gradient at the first time and switches them to FP32 when update param?
Anyone's reply will be helpful to me, thanks
The text was updated successfully, but these errors were encountered: