[QUESTION] Why BF16 has FP32 for param gradient at the first time unlike FP16 has FP16 for param gradient at the first time and switches them to FP32 when update param? #1335

renyinCheng001 · 2024-12-24T03:25:20Z

Hi, All

I saw that Megatron-LM supports the configuration of FP16 and BF16 mixed-precision data types, and I found that these two different data type configurations correspond to different parameter types and gradient types:

	FP16 Training	BF16 Training
Weight	FP16	BF16
Gradient	FP16	FP32

When training with FP16, a copy of the FP32 gradient is required before updating the parameters
When training with BF16, no additional copies are required because the gradient itself is FP32

So, why BF16 has FP32 for param gradient at the first time unlike FP16 has FP16 for param gradient at the first time and switches them to FP32 when update param?

Anyone's reply will be helpful to me, thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] Why BF16 has FP32 for param gradient at the first time unlike FP16 has FP16 for param gradient at the first time and switches them to FP32 when update param? #1335

[QUESTION] Why BF16 has FP32 for param gradient at the first time unlike FP16 has FP16 for param gradient at the first time and switches them to FP32 when update param? #1335

renyinCheng001 commented Dec 24, 2024

[QUESTION] Why BF16 has FP32 for param gradient at the first time unlike FP16 has FP16 for param gradient at the first time and switches them to FP32 when update param? #1335

[QUESTION] Why BF16 has FP32 for param gradient at the first time unlike FP16 has FP16 for param gradient at the first time and switches them to FP32 when update param? #1335

Comments

renyinCheng001 commented Dec 24, 2024