Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM when saving torch_dist checkpoint #436

Open
Cppowboy opened this issue Dec 7, 2024 · 5 comments
Open

OOM when saving torch_dist checkpoint #436

Cppowboy opened this issue Dec 7, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@Cppowboy
Copy link

Cppowboy commented Dec 7, 2024

Describe the bug

OOM when saving torch_dist checkpoint, but when using zarr dist_ckpt_format, it works fine.
Image

Steps/Code to reproduce bug

OOM happens when training a qwen2.5 7b reward model.

set -x
export NCCL_DEBUG=INFO
export NCCL_SOCKET_NTHREADS=8
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export CUDA_MANAGED_FORCE_DEVICE_ALLOC=0.9999
export TORCH_NCCL_ASYNC_ERROR_HANDLING=3
# setup environment
# export PIP_CONFIG_FILE=examples/nlp/minicpm/pip.conf
# pip install transformers -U
# pip install datamodel_code_generator
pip install tokenizers==0.20.3 -U -i https://pypi.hs1.paratera.com/root/pypi/+simple

export GPFS="/local/apps/NeMo-Aligner"

# use minicpm megatron-lm
export PYTHONPATH=.:path_to_megatron_and_nemo
export GPUS_PER_NODE=8
export CUDA_DEVICE_MAX_CONNECTIONS=1

TRAIN_DATA_PATH="***l"
VALID_DATA_PATH="***"
export CKPT_PATH="mcore_qwen2.5_7b_instruct.nemo"
export HYDRA_FULL_ERROR=1
# export TORCH_CPP_LOG_LEVEL=INFO

LOGDIR=/data/logs/
mkdir -p ${LOGDIR}

# pytorch lightning init need this environment variable
export NODE_RANK=$RANK
unset RANK
# tensor parallel speedup, recommended
export CUDA_DEVICE_MAX_CONNECTIONS=1
mkdir -p /data/checkpoints/${JOB_UID}

python -u ${GPFS}/examples/nlp/qwen/train_qwen2.5_7b_reward.py \
   trainer.num_nodes=${WORLD_SIZE} \
   trainer.devices=${GPUS_PER_NODE} \
   ++trainer.rm.max_epochs=2 \
   ++trainer.rm.save_interval=50 \
   ++trainer.rm.log_task_loss_interval=10 \
   ++model.optim.lr=1e-5 \
   ++model.optim.weight_decay=0.001 \
   ++model.optim.sched.warmup_steps=100 \
   ++model.optim.sched.min_lr=0 \
   ++model.optim.sched.constant_steps=0 \
   ++model.micro_batch_size=1 \
   ++model.global_batch_size=128 \
   ++model.encoder_seq_length=8192 \
   ++model.seq_length=8192 \
   ++model.max_position_embeddings=8192 \
   ++model.tensor_model_parallel_size=1 \
   ++model.pipeline_model_parallel_size=1 \
   ++model.rotary_base=10000 \
   ++model.use_longrope=true \
   ++model.use_flash_attention=true \
   ++model.transformer_engine=true \
   ++model.activations_checkpoint_granularity=full \
   ++model.activations_checkpoint_method=uniform \
   ++model.activations_checkpoint_num_layers=1 \
   pretrained_checkpoint.restore_from_path=$CKPT_PATH \
   ++model.tokenizer.type=/Qwen2.5-7B-Instruct \
   ++model.tokenizer.model=/Qwen2.5-7B-Instruct/tokenizer.model \
   "++model.data.data_prefix={train: [${TRAIN_DATA_PATH}], validation: [${VALID_DATA_PATH}], test: [${VALID_DATA_PATH}]}" \
   "++model.data.train_valid_test_num_samples=[88000, 100, 100]" \
   ++model.data.append_eod=true \
   exp_manager.explicit_log_dir=/data/logs/tensorboard \
   +trainer.enable_progress_bar=false \
   +trainer.log_every_n_steps=10 \
   ++exp_manager.create_tensorboard_logger=true \
   ++exp_manager.create_checkpoint_callback=true \
   ++exp_manager.checkpoint_callback_params.dirpath=/data/checkpoints/${JOB_UID} \
   +exp_manager.log_step_timing=true \
   exp_manager.create_wandb_logger=false \
   exp_manager.wandb_logger_kwargs.project=reward_training \
   exp_manager.wandb_logger_kwargs.name=reward_training

Environment details

If NVIDIA docker image is used you don't need to specify these.
using nemo docker 24.09

Nemo-Aligner version: 0.5.0
NeMo version: 2.0.0
Megatron-LM version: 0.9.0

@Cppowboy Cppowboy added the bug Something isn't working label Dec 7, 2024
@better629
Copy link

@Cppowboy which stage do you happen with OOM. start to train or trained or trained/save?
I start to train a 7b reward model but meet OOM in Training steps: 0%| and inside optimizer.step. Try to adjust the batch or length parameters, but still OOM.

@Cppowboy
Copy link
Author

Cppowboy commented Dec 10, 2024

@Cppowboy which stage do you happen with OOM. start to train or trained or trained/save? I start to train a 7b reward model but meet OOM in Training steps: 0%| and inside optimizer.step. Try to adjust the batch or length parameters, but still OOM.

I encountered OOM issues during different stages of training. Initially, OOM occurred when untarring the Nemo file at startup. To work around this, I extracted the Nemo file manually and loaded the model state directly from the checkpoint directory.

While this resolved the initial OOM issue, I then faced OOM errors during checkpoint saving. The solution was to modify the checkpoint format by adding model.dist_ckpt_format=zarr to the config.yaml file.

This suggests there might be memory management inefficiencies with the torch_dist checkpoint format.

@lethean1
Copy link

I met a similar problem, may I inquire about your hardware configuration?

@better629
Copy link

I met a similar problem, may I inquire about your hardware configuration?我也遇到了类似的问题,请问你的硬件配置如何?

@lethean1 A800 80GB multi-cards, to train a 7b reward model.

@Liang-Qiu
Copy link

Similar issue happend to me when I train a 12B Mistral reward model. NCCL timeout in rank 6 (8xA100 80G machine) when it tried to save the torch dist ckpt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants