-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using zero3 on multiple nodes is slow #6889
Comments
@HelloWorld506, you can try the hpz feature of ZeRO++ if it fits your scenario. |
it works, thank you! |
@tjruwase Hello, using zero++ did indeed speed up my training, but during the training process, the loss remained at 11.9321 and the grad_norm remained at 0, resulting in training failure. What is the reason for this and how can I resolve it ? |
@HelloWorld506, do you get the error if you don't use offloading? |
@tjruwase However, if my 72B model does not use offloading on a single node, it will cause OOM |
Got it. The reason for my question is to determine whether the problem is caused by interaction of zero++ and offloading. I don't think that combination is well tested. Is it possible for you to use a smaller model, e.g., 10B to investigate this issue? My thinking is to compare the loss curves of
|
@tjruwase I tried it, but it did not work, nothing changed, the loss still remained at 11.9321 and the grad_norm remained at 0. |
I have multiple nodes, each with 8 40G A100, and I want to train a 72B model
When using zero3, the 72B model is distributed to all GPUs of all nodes. Even with nvlink, the communication delay is still very high, resulting in slow training speed, much slower than using zero3+offloading for a single node. The problem is that the more nodes there are, the slower the training speed. It is better to use only a single node
Is there a way to control zero3 to only allocate model parameters to the same node, where each node stores a complete model and only uses synchronous gradients between nodes to speed up training
The text was updated successfully, but these errors were encountered: