-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM when saving torch_dist checkpoint #436
Comments
@Cppowboy which stage do you happen with OOM. start to train or trained or trained/save? |
I encountered OOM issues during different stages of training. Initially, OOM occurred when untarring the Nemo file at startup. To work around this, I extracted the Nemo file manually and loaded the model state directly from the checkpoint directory. While this resolved the initial OOM issue, I then faced OOM errors during checkpoint saving. The solution was to modify the checkpoint format by adding model.dist_ckpt_format=zarr to the config.yaml file. This suggests there might be memory management inefficiencies with the torch_dist checkpoint format. |
I met a similar problem, may I inquire about your hardware configuration? |
@lethean1 A800 80GB multi-cards, to train a 7b reward model. |
Similar issue happend to me when I train a 12B Mistral reward model. NCCL timeout in rank 6 (8xA100 80G machine) when it tried to save the torch dist ckpt. |
Describe the bug
OOM when saving torch_dist checkpoint, but when using zarr dist_ckpt_format, it works fine.
Steps/Code to reproduce bug
OOM happens when training a qwen2.5 7b reward model.
Environment details
If NVIDIA docker image is used you don't need to specify these.
using nemo docker 24.09
Nemo-Aligner version: 0.5.0
NeMo version: 2.0.0
Megatron-LM version: 0.9.0
The text was updated successfully, but these errors were encountered: