Using zero3 on multiple nodes is slow #6889

HelloWorld506 · 2024-12-18T03:43:41Z

I have multiple nodes, each with 8 40G A100, and I want to train a 72B model
When using zero3, the 72B model is distributed to all GPUs of all nodes. Even with nvlink, the communication delay is still very high, resulting in slow training speed, much slower than using zero3+offloading for a single node. The problem is that the more nodes there are, the slower the training speed. It is better to use only a single node
Is there a way to control zero3 to only allocate model parameters to the same node, where each node stores a complete model and only uses synchronous gradients between nodes to speed up training

tjruwase · 2024-12-18T04:19:55Z

@HelloWorld506, you can try the hpz feature of ZeRO++ if it fits your scenario.

HelloWorld506 · 2024-12-18T11:47:19Z

it works, thank you!

HelloWorld506 · 2025-01-06T08:55:46Z

@tjruwase Hello, using zero++ did indeed speed up my training, but during the training process, the loss remained at 11.9321 and the grad_norm remained at 0, resulting in training failure. What is the reason for this and how can I resolve it ？
My Deepspeed configuration file is as follows：
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"zero_hpz_partition_size": 8,
"zero_quantized_weights": false,
"zero_quantized_gradients": false,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}
}

tjruwase · 2025-01-20T17:07:54Z

@HelloWorld506, do you get the error if you don't use offloading?

HelloWorld506 · 2025-01-21T06:19:30Z

@HelloWorld506, do you get the error if you don't use offloading?

@tjruwase However, if my 72B model does not use offloading on a single node, it will cause OOM

tjruwase · 2025-01-21T15:38:34Z

Got it. The reason for my question is to determine whether the problem is caused by interaction of zero++ and offloading. I don't think that combination is well tested.

Is it possible for you to use a smaller model, e.g., 10B to investigate this issue? My thinking is to compare the loss curves of

ZeRO++
ZeRO++ + Offload
ZeRO3 + Offload

HelloWorld506 · 2025-01-22T03:00:11Z

@HelloWorld506, do you get the error if you don't use offloading?

@tjruwase I tried it, but it did not work, nothing changed, the loss still remained at 11.9321 and the grad_norm remained at 0.
only using ZeRO3 + Offload can work

HelloWorld506 added bug Something isn't working training labels Dec 18, 2024

HelloWorld506 closed this as completed Dec 18, 2024

HelloWorld506 reopened this Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using zero3 on multiple nodes is slow #6889

Using zero3 on multiple nodes is slow #6889

HelloWorld506 commented Dec 18, 2024

tjruwase commented Dec 18, 2024

HelloWorld506 commented Dec 18, 2024

HelloWorld506 commented Jan 6, 2025

tjruwase commented Jan 20, 2025

HelloWorld506 commented Jan 21, 2025

tjruwase commented Jan 21, 2025

HelloWorld506 commented Jan 22, 2025

Using zero3 on multiple nodes is slow #6889

Using zero3 on multiple nodes is slow #6889

Comments

HelloWorld506 commented Dec 18, 2024

tjruwase commented Dec 18, 2024

HelloWorld506 commented Dec 18, 2024

HelloWorld506 commented Jan 6, 2025

tjruwase commented Jan 20, 2025

HelloWorld506 commented Jan 21, 2025

tjruwase commented Jan 21, 2025

HelloWorld506 commented Jan 22, 2025