Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed Training Hangs at WandbPredictionProgressCallback with NCCL Timeout Error #6

Open
LiuZH-19 opened this issue May 15, 2024 · 0 comments

Comments

@LiuZH-19
Copy link

Hello,
I'm experiencing an issue with distributed training. Specifically, the training process stalls at the WandbPredictionProgressCallback during the on_train_begin event and is subsequently followed by an NCCL timeout error. Interestingly, when I omit the WandbPredictionProgressCallback, the distributed training proceeds without any issues. I would like to know if your codebase is designed to support distributed training. Could you please advise on how to resolve this issue? Thank you for your assistance!

Error Message:

***** Running Prediction *****                                                                                                                                          
  Num examples = 4                                                                                                                                                      
  Batch size = 1                                                                                                                                                        
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed 
by a call to the `pad` method to get a padded encoding.                                                                                                                 
[rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800
000) ran for 1800045 milliseconds before timing out.                                                                                                                    
[rank6]:[E ProcessGroupNCCL.cpp:563] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800
000) ran for 1800091 milliseconds before timing out.                                                                                                                    
[rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800
000) ran for 1800230 milliseconds before timing out.                                                                                                                    
[rank7]:[E ProcessGroupNCCL.cpp:563] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800
000) ran for 1800032 milliseconds before timing out.                                                                                                                    
[rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 1] Timeout at NCCL work: 4, last enqueued NCCL work: 4, last completed NCCL work: 3.                                   
[rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations m
ight run on corrupted/incomplete data.                                                                                                                                  
[rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down.                                                       
[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: Work
NCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800045 milliseconds before timing out.                                            
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):                                                 
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f212d3f7897 in /mnt/petrelfs/xxx/anaconda3/envs/dreambooth/lib/python3.9/site-packages/tor
ch/lib/libc10.so)                         
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f212e6d21b2 in /mnt/petrelfs/l
iuzihan/anaconda3/envs/dreambooth/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)                                                                               
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f212e6d6fd0 in /mnt/petrelfs/xxx/anaconda3/envs/dreambooth/lib/python3.9/site-packages/torch/lib/li
btorch_cuda.so)                           
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f212e6d831c in /mnt/petrelfs/xxx/anaconda3/envs/dreambooth/lib/python3.9/site-packages/torch/lib/l
ibtorch_cuda.so)                          
frame #4: <unknown function> + 0xdbbf4 (0x7f2183bbcbf4 in /mnt/petrelfs/xxx/anaconda3/envs/dreambooth/bin/../lib/libstdc++.so.6)                                   
frame #5: <unknown function> + 0x7dd5 (0x7f219352edd5 in /lib64/libpthread.so.0)                                                                                        
frame #6: clone + 0x6d (0x7f2192b4eead in /lib64/libc.so.6)                         

[rank6]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 6] Timeout at NCCL work: 4, last enqueued NCCL work: 4, last completed NCCL work: 3.                                   
[rank6]:[E ProcessGroupNCCL.cpp:577] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations m
ight run on corrupted/incomplete data.    
[rank6]:[E ProcessGroupNCCL.cpp:583] [Rank 6] To avoid data inconsistency, we are taking the entire process down.                                                       
[rank6]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: Work
NCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800091 milliseconds before timing out.                                            
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):                                                 
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f638b503897 in /mnt/petrelfs/xxx/anaconda3/envs/dreambooth/lib/python3.9/site-packages/tor
ch/lib/libc10.so)                         
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f638c7de1b2 in /mnt/petrelfs/l
iuzihan/anaconda3/envs/dreambooth/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant