OOM when saving torch_dist checkpoint #436

Cppowboy · 2024-12-07T02:40:05Z

Describe the bug

OOM when saving torch_dist checkpoint, but when using zarr dist_ckpt_format, it works fine.

Steps/Code to reproduce bug

OOM happens when training a qwen2.5 7b reward model.

set -x
export NCCL_DEBUG=INFO
export NCCL_SOCKET_NTHREADS=8
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export CUDA_MANAGED_FORCE_DEVICE_ALLOC=0.9999
export TORCH_NCCL_ASYNC_ERROR_HANDLING=3
# setup environment
# export PIP_CONFIG_FILE=examples/nlp/minicpm/pip.conf
# pip install transformers -U
# pip install datamodel_code_generator
pip install tokenizers==0.20.3 -U -i https://pypi.hs1.paratera.com/root/pypi/+simple

export GPFS="/local/apps/NeMo-Aligner"

# use minicpm megatron-lm
export PYTHONPATH=.:path_to_megatron_and_nemo
export GPUS_PER_NODE=8
export CUDA_DEVICE_MAX_CONNECTIONS=1

TRAIN_DATA_PATH="***l"
VALID_DATA_PATH="***"
export CKPT_PATH="mcore_qwen2.5_7b_instruct.nemo"
export HYDRA_FULL_ERROR=1
# export TORCH_CPP_LOG_LEVEL=INFO

LOGDIR=/data/logs/
mkdir -p ${LOGDIR}

# pytorch lightning init need this environment variable
export NODE_RANK=$RANK
unset RANK
# tensor parallel speedup, recommended
export CUDA_DEVICE_MAX_CONNECTIONS=1
mkdir -p /data/checkpoints/${JOB_UID}

python -u ${GPFS}/examples/nlp/qwen/train_qwen2.5_7b_reward.py \
   trainer.num_nodes=${WORLD_SIZE} \
   trainer.devices=${GPUS_PER_NODE} \
   ++trainer.rm.max_epochs=2 \
   ++trainer.rm.save_interval=50 \
   ++trainer.rm.log_task_loss_interval=10 \
   ++model.optim.lr=1e-5 \
   ++model.optim.weight_decay=0.001 \
   ++model.optim.sched.warmup_steps=100 \
   ++model.optim.sched.min_lr=0 \
   ++model.optim.sched.constant_steps=0 \
   ++model.micro_batch_size=1 \
   ++model.global_batch_size=128 \
   ++model.encoder_seq_length=8192 \
   ++model.seq_length=8192 \
   ++model.max_position_embeddings=8192 \
   ++model.tensor_model_parallel_size=1 \
   ++model.pipeline_model_parallel_size=1 \
   ++model.rotary_base=10000 \
   ++model.use_longrope=true \
   ++model.use_flash_attention=true \
   ++model.transformer_engine=true \
   ++model.activations_checkpoint_granularity=full \
   ++model.activations_checkpoint_method=uniform \
   ++model.activations_checkpoint_num_layers=1 \
   pretrained_checkpoint.restore_from_path=$CKPT_PATH \
   ++model.tokenizer.type=/Qwen2.5-7B-Instruct \
   ++model.tokenizer.model=/Qwen2.5-7B-Instruct/tokenizer.model \
   "++model.data.data_prefix={train: [${TRAIN_DATA_PATH}], validation: [${VALID_DATA_PATH}], test: [${VALID_DATA_PATH}]}" \
   "++model.data.train_valid_test_num_samples=[88000, 100, 100]" \
   ++model.data.append_eod=true \
   exp_manager.explicit_log_dir=/data/logs/tensorboard \
   +trainer.enable_progress_bar=false \
   +trainer.log_every_n_steps=10 \
   ++exp_manager.create_tensorboard_logger=true \
   ++exp_manager.create_checkpoint_callback=true \
   ++exp_manager.checkpoint_callback_params.dirpath=/data/checkpoints/${JOB_UID} \
   +exp_manager.log_step_timing=true \
   exp_manager.create_wandb_logger=false \
   exp_manager.wandb_logger_kwargs.project=reward_training \
   exp_manager.wandb_logger_kwargs.name=reward_training

Environment details

If NVIDIA docker image is used you don't need to specify these.
using nemo docker 24.09

Nemo-Aligner version: 0.5.0
NeMo version: 2.0.0
Megatron-LM version: 0.9.0

better629 · 2024-12-10T11:58:22Z

@Cppowboy which stage do you happen with OOM. start to train or trained or trained/save?
I start to train a 7b reward model but meet OOM in Training steps: 0%| and inside optimizer.step. Try to adjust the batch or length parameters, but still OOM.

Cppowboy · 2024-12-10T12:22:44Z

@Cppowboy which stage do you happen with OOM. start to train or trained or trained/save? I start to train a 7b reward model but meet OOM in Training steps: 0%| and inside optimizer.step. Try to adjust the batch or length parameters, but still OOM.

I encountered OOM issues during different stages of training. Initially, OOM occurred when untarring the Nemo file at startup. To work around this, I extracted the Nemo file manually and loaded the model state directly from the checkpoint directory.

While this resolved the initial OOM issue, I then faced OOM errors during checkpoint saving. The solution was to modify the checkpoint format by adding model.dist_ckpt_format=zarr to the config.yaml file.

This suggests there might be memory management inefficiencies with the torch_dist checkpoint format.

lethean1 · 2024-12-10T14:35:57Z

I met a similar problem, may I inquire about your hardware configuration?

better629 · 2024-12-11T01:39:33Z

I met a similar problem, may I inquire about your hardware configuration?我也遇到了类似的问题，请问你的硬件配置如何？

@lethean1 A800 80GB multi-cards, to train a 7b reward model.

Liang-Qiu · 2024-12-19T21:44:37Z

Similar issue happend to me when I train a 12B Mistral reward model. NCCL timeout in rank 6 (8xA100 80G machine) when it tried to save the torch dist ckpt.

Cppowboy added the bug Something isn't working label Dec 7, 2024

better629 mentioned this issue Dec 11, 2024

use lightning or pytorch-lightning #438

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM when saving torch_dist checkpoint #436

OOM when saving torch_dist checkpoint #436

Cppowboy commented Dec 7, 2024

better629 commented Dec 10, 2024

Cppowboy commented Dec 10, 2024 •

edited

Loading

lethean1 commented Dec 10, 2024

better629 commented Dec 11, 2024

Liang-Qiu commented Dec 19, 2024

OOM when saving torch_dist checkpoint #436

OOM when saving torch_dist checkpoint #436

Comments

Cppowboy commented Dec 7, 2024

better629 commented Dec 10, 2024

Cppowboy commented Dec 10, 2024 • edited Loading

lethean1 commented Dec 10, 2024

better629 commented Dec 11, 2024

Liang-Qiu commented Dec 19, 2024

Cppowboy commented Dec 10, 2024 •

edited

Loading