Distributed Training Hangs at WandbPredictionProgressCallback with NCCL Timeout Error #6

LiuZH-19 · 2024-05-15T09:51:35Z

Hello,
I'm experiencing an issue with distributed training. Specifically, the training process stalls at the WandbPredictionProgressCallback during the on_train_begin event and is subsequently followed by an NCCL timeout error. Interestingly, when I omit the WandbPredictionProgressCallback, the distributed training proceeds without any issues. I would like to know if your codebase is designed to support distributed training. Could you please advise on how to resolve this issue? Thank you for your assistance!

Error Message:

***** Running Prediction *****                                                                                                                                          
  Num examples = 4                                                                                                                                                      
  Batch size = 1                                                                                                                                                        
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed 
by a call to the `pad` method to get a padded encoding.                                                                                                                 
[rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800
000) ran for 1800045 milliseconds before timing out.                                                                                                                    
[rank6]:[E ProcessGroupNCCL.cpp:563] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800
000) ran for 1800091 milliseconds before timing out.                                                                                                                    
[rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800
000) ran for 1800230 milliseconds before timing out.                                                                                                                    
[rank7]:[E ProcessGroupNCCL.cpp:563] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800
000) ran for 1800032 milliseconds before timing out.                                                                                                                    
[rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 1] Timeout at NCCL work: 4, last enqueued NCCL work: 4, last completed NCCL work: 3.                                   
[rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations m
ight run on corrupted/incomplete data.                                                                                                                                  
[rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down.                                                       
[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: Work
NCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800045 milliseconds before timing out.                                            
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):                                                 
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f212d3f7897 in /mnt/petrelfs/xxx/anaconda3/envs/dreambooth/lib/python3.9/site-packages/tor
ch/lib/libc10.so)                         
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f212e6d21b2 in /mnt/petrelfs/l
iuzihan/anaconda3/envs/dreambooth/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)                                                                               
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f212e6d6fd0 in /mnt/petrelfs/xxx/anaconda3/envs/dreambooth/lib/python3.9/site-packages/torch/lib/li
btorch_cuda.so)                           
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f212e6d831c in /mnt/petrelfs/xxx/anaconda3/envs/dreambooth/lib/python3.9/site-packages/torch/lib/l
ibtorch_cuda.so)                          
frame #4: <unknown function> + 0xdbbf4 (0x7f2183bbcbf4 in /mnt/petrelfs/xxx/anaconda3/envs/dreambooth/bin/../lib/libstdc++.so.6)                                   
frame #5: <unknown function> + 0x7dd5 (0x7f219352edd5 in /lib64/libpthread.so.0)                                                                                        
frame #6: clone + 0x6d (0x7f2192b4eead in /lib64/libc.so.6)                         

[rank6]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 6] Timeout at NCCL work: 4, last enqueued NCCL work: 4, last completed NCCL work: 3.                                   
[rank6]:[E ProcessGroupNCCL.cpp:577] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations m
ight run on corrupted/incomplete data.    
[rank6]:[E ProcessGroupNCCL.cpp:583] [Rank 6] To avoid data inconsistency, we are taking the entire process down.                                                       
[rank6]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: Work
NCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800091 milliseconds before timing out.                                            
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):                                                 
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f638b503897 in /mnt/petrelfs/xxx/anaconda3/envs/dreambooth/lib/python3.9/site-packages/tor
ch/lib/libc10.so)                         
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f638c7de1b2 in /mnt/petrelfs/l
iuzihan/anaconda3/envs/dreambooth/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed Training Hangs at WandbPredictionProgressCallback with NCCL Timeout Error #6

Distributed Training Hangs at WandbPredictionProgressCallback with NCCL Timeout Error #6

LiuZH-19 commented May 15, 2024

Distributed Training Hangs at WandbPredictionProgressCallback with NCCL Timeout Error #6

Distributed Training Hangs at WandbPredictionProgressCallback with NCCL Timeout Error #6

Comments

LiuZH-19 commented May 15, 2024