Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using zero3 on multiple nodes is slow #6889

Open
HelloWorld506 opened this issue Dec 18, 2024 · 7 comments
Open

Using zero3 on multiple nodes is slow #6889

HelloWorld506 opened this issue Dec 18, 2024 · 7 comments
Labels
bug Something isn't working training

Comments

@HelloWorld506
Copy link

I have multiple nodes, each with 8 40G A100, and I want to train a 72B model
When using zero3, the 72B model is distributed to all GPUs of all nodes. Even with nvlink, the communication delay is still very high, resulting in slow training speed, much slower than using zero3+offloading for a single node. The problem is that the more nodes there are, the slower the training speed. It is better to use only a single node
Is there a way to control zero3 to only allocate model parameters to the same node, where each node stores a complete model and only uses synchronous gradients between nodes to speed up training

@HelloWorld506 HelloWorld506 added bug Something isn't working training labels Dec 18, 2024
@tjruwase
Copy link
Contributor

@HelloWorld506, you can try the hpz feature of ZeRO++ if it fits your scenario.

@HelloWorld506
Copy link
Author

it works, thank you!

@HelloWorld506
Copy link
Author

@tjruwase Hello, using zero++ did indeed speed up my training, but during the training process, the loss remained at 11.9321 and the grad_norm remained at 0, resulting in training failure. What is the reason for this and how can I resolve it ?
My Deepspeed configuration file is as follows:
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"zero_hpz_partition_size": 8,
"zero_quantized_weights": false,
"zero_quantized_gradients": false,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}
}

@HelloWorld506 HelloWorld506 reopened this Jan 6, 2025
@tjruwase
Copy link
Contributor

@HelloWorld506, do you get the error if you don't use offloading?

@HelloWorld506
Copy link
Author

@HelloWorld506, do you get the error if you don't use offloading?

@tjruwase However, if my 72B model does not use offloading on a single node, it will cause OOM

@tjruwase
Copy link
Contributor

Got it. The reason for my question is to determine whether the problem is caused by interaction of zero++ and offloading. I don't think that combination is well tested.

Is it possible for you to use a smaller model, e.g., 10B to investigate this issue? My thinking is to compare the loss curves of

  1. ZeRO++
  2. ZeRO++ + Offload
  3. ZeRO3 + Offload

@HelloWorld506
Copy link
Author

@HelloWorld506, do you get the error if you don't use offloading?

@tjruwase I tried it, but it did not work, nothing changed, the loss still remained at 11.9321 and the grad_norm remained at 0.
only using ZeRO3 + Offload can work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

2 participants