Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpo with multiple GPUs got stuck #478

Open
SuMeng123 opened this issue Mar 5, 2025 · 1 comment
Open

grpo with multiple GPUs got stuck #478

SuMeng123 opened this issue Mar 5, 2025 · 1 comment

Comments

@SuMeng123
Copy link

SuMeng123 commented Mar 5, 2025

grpo with 8GPUs,The first epoch is successful, but it got stuck during the second epoch.

[rank0]:[E306 00:30:11.701177518 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL's watchdog got stuck for 480 seconds without making progress in monitoring enqueued collectives. This typically indicates a NCCL/CUDA API (e.g., CudaEventDestroy) hang blocking the watchdog, and could be triggered by another thread holding the GIL inside a CUDA api (for example, CudaEventDestroy), or other deadlock-prone behaviors.If you suspect the watchdog is not actually stuck and a longer timeout would help, you can either increase the timeout (TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC) to a larger value or disable the heartbeat monitor (TORCH_NCCL_ENABLE_MONITORING=0).If either of aforementioned helps, feel free to file an issue to PyTorch about the short timeout or false positive abort; otherwise, please attempt to debug the hang.

@tastelikefeet
Copy link

You can try SWIFT based on the awesome work of TRL, which has supported multi-gpus and tensor-parallel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants