You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello,
I'm experiencing an issue with distributed training. Specifically, the training process stalls at the WandbPredictionProgressCallback during the on_train_begin event and is subsequently followed by an NCCL timeout error. Interestingly, when I omit the WandbPredictionProgressCallback, the distributed training proceeds without any issues. I would like to know if your codebase is designed to support distributed training. Could you please advise on how to resolve this issue? Thank you for your assistance!
Error Message:
***** Running Prediction *****
Num examples = 4
Batch size = 1
You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed
by a call to the `pad` method to get a padded encoding.
[rank1]:[E ProcessGroupNCCL.cpp:563] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800
000) ran for 1800045 milliseconds before timing out.
[rank6]:[E ProcessGroupNCCL.cpp:563] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800
000) ran for 1800091 milliseconds before timing out.
[rank3]:[E ProcessGroupNCCL.cpp:563] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800
000) ran for 1800230 milliseconds before timing out.
[rank7]:[E ProcessGroupNCCL.cpp:563] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800
000) ran for 1800032 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 1] Timeout at NCCL work: 4, last enqueued NCCL work: 4, last completed NCCL work: 3.
[rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations m
ight run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: Work
NCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800045 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f212d3f7897 in /mnt/petrelfs/xxx/anaconda3/envs/dreambooth/lib/python3.9/site-packages/tor
ch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f212e6d21b2 in /mnt/petrelfs/l
iuzihan/anaconda3/envs/dreambooth/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7f212e6d6fd0 in /mnt/petrelfs/xxx/anaconda3/envs/dreambooth/lib/python3.9/site-packages/torch/lib/li
btorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f212e6d831c in /mnt/petrelfs/xxx/anaconda3/envs/dreambooth/lib/python3.9/site-packages/torch/lib/l
ibtorch_cuda.so)
frame #4: <unknown function> + 0xdbbf4 (0x7f2183bbcbf4 in /mnt/petrelfs/xxx/anaconda3/envs/dreambooth/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x7dd5 (0x7f219352edd5 in /lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7f2192b4eead in /lib64/libc.so.6)
[rank6]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 6] Timeout at NCCL work: 4, last enqueued NCCL work: 4, last completed NCCL work: 3.
[rank6]:[E ProcessGroupNCCL.cpp:577] [Rank 6] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations m
ight run on corrupted/incomplete data.
[rank6]:[E ProcessGroupNCCL.cpp:583] [Rank 6] To avoid data inconsistency, we are taking the entire process down.
[rank6]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 6] Process group watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: Work
NCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800091 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f638b503897 in /mnt/petrelfs/xxx/anaconda3/envs/dreambooth/lib/python3.9/site-packages/tor
ch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f638c7de1b2 in /mnt/petrelfs/l
iuzihan/anaconda3/envs/dreambooth/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
The text was updated successfully, but these errors were encountered:
Hello,
I'm experiencing an issue with distributed training. Specifically, the training process stalls at the
WandbPredictionProgressCallback
during theon_train_begin
event and is subsequently followed by an NCCL timeout error. Interestingly, when I omit theWandbPredictionProgressCallback
, the distributed training proceeds without any issues. I would like to know if your codebase is designed to support distributed training. Could you please advise on how to resolve this issue? Thank you for your assistance!Error Message:
The text was updated successfully, but these errors were encountered: