Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] Why BF16 has FP32 for param gradient at the first time unlike FP16 has FP16 for param gradient at the first time and switches them to FP32 when update param? #1335

Open
renyinCheng001 opened this issue Dec 24, 2024 · 0 comments

Comments

@renyinCheng001
Copy link

Hi, All

I saw that Megatron-LM supports the configuration of FP16 and BF16 mixed-precision data types, and I found that these two different data type configurations correspond to different parameter types and gradient types:

FP16 Training BF16 Training
Weight FP16 BF16
Gradient FP16 FP32
  • When training with FP16, a copy of the FP32 gradient is required before updating the parameters
  • When training with BF16, no additional copies are required because the gradient itself is FP32

So, why BF16 has FP32 for param gradient at the first time unlike FP16 has FP16 for param gradient at the first time and switches them to FP32 when update param?

Anyone's reply will be helpful to me, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant