PPO Training Hangs at Step 0 when use_remove_padding #387

maksimstw · 2025-02-26T03:06:48Z

When training the Qwen 2.5 7B Math model using the example script below, the training process consistently hangs at step 0, with GPU utilization dropping to 0%. This issue occurs when using 8 A100 GPUs on a single node. However, if use_remove_padding is set to False, the training proceeds without any problems. What might be the issue?

set -x

gsm8k_train_path=$HOME/data/gsm8k/train.parquet
gsm8k_test_path=$HOME/data/gsm8k/test.parquet
math_train_path=$HOME/data/math/train.parquet
math_test_path=$HOME/data/math/test.parquet

train_files="['$gsm8k_train_path', '$math_train_path']"
test_files="['$gsm8k_test_path', '$math_test_path']"

python3 -m verl.trainer.main_ppo \
    data.train_files="$train_files" \
    data.val_files="$test_files" \
    data.train_batch_size=1024 \
    data.max_prompt_length=1024 \
    data.max_response_length=1024 \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-32B-Instruct \
    actor_rollout_ref.model.enable_gradient_checkpointing=False \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    critic.optim.lr=1e-5 \
    critic.model.use_remove_padding=True \
    critic.model.path=Qwen/Qwen2.5-32B-Instruct \
    critic.model.enable_gradient_checkpointing=False \
    critic.ppo_micro_batch_size_per_gpu=8 \
    critic.model.fsdp_config.param_offload=False \
    critic.model.fsdp_config.optimizer_offload=False \
    algorithm.kl_ctrl.kl_coef=0.0001 \
    trainer.critic_warmup=0 \
    trainer.logger=['console','wandb'] \
    trainer.project_name='verl_example' \
    trainer.experiment_name='Qwen2.5-32B-Instruct_function_rm' \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=4 \
    trainer.save_freq=-1 \
    trainer.test_freq=10 \
    trainer.total_epochs=15 $@

The text was updated successfully, but these errors were encountered:

eric-haibin-lin · 2025-02-27T21:40:45Z

Could you check with py-spy dump --pid xxx, or run with breakpoint and ray debug (see faq page) to see where the program hangs at?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PPO Training Hangs at Step 0 when use_remove_padding #387

PPO Training Hangs at Step 0 when use_remove_padding #387

maksimstw commented Feb 26, 2025

eric-haibin-lin commented Feb 27, 2025

PPO Training Hangs at Step 0 when use_remove_padding #387

PPO Training Hangs at Step 0 when use_remove_padding #387

Comments

maksimstw commented Feb 26, 2025

eric-haibin-lin commented Feb 27, 2025