grpo with multiple GPUs got stuck #478

SuMeng123 · 2025-03-05T16:42:20Z

grpo with 8GPUs，The first epoch is successful, but it got stuck during the second epoch.

[rank0]:[E306 00:30:11.701177518 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL's watchdog got stuck for 480 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API (e.g., CudaEventDestroy) hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api (for example, CudaEventDestroy), or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang.

tastelikefeet · 2025-03-08T04:38:50Z

You can try SWIFT based on the awesome work of TRL, which has supported multi-gpus and tensor-parallel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grpo with multiple GPUs got stuck #478

grpo with multiple GPUs got stuck #478

SuMeng123 commented Mar 5, 2025 •

edited

Loading

tastelikefeet commented Mar 8, 2025

grpo with multiple GPUs got stuck #478

grpo with multiple GPUs got stuck #478

Comments

SuMeng123 commented Mar 5, 2025 • edited Loading

tastelikefeet commented Mar 8, 2025

SuMeng123 commented Mar 5, 2025 •

edited

Loading