RuntimeError: NCCL Error 2: unhandled system error #198

waduhekx · 2021-08-23T07:49:43Z

when i use two gpus to run the main.py to train model on sthv2 dataset, got error as below:

Traceback (most recent call last):
File "main.py", line 378, in
main()
File "main.py", line 194, in main
train(train_loader, model, criterion, optimizer, epoch, log_training, tf_writer)
File "main.py", line 244, in train
output = model(input_var)
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 151, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 156, in replicate
return replicate(module, device_ids, not torch.is_grad_enabled())
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/parallel/replicate.py", line 88, in replicate
param_copies = _broadcast_coalesced_reshape(params, devices, detach)
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/parallel/replicate.py", line 71, in _broadcast_coalesced_reshape
tensor_copies = Broadcast.apply(devices, *tensors)
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 21, in forward
outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/cuda/comm.py", line 39, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: NCCL Error 2: unhandled system error

waduhekx · 2021-08-23T07:50:27Z

how can i solve this problem? please.

Luffy03 · 2022-02-10T09:08:04Z

Have you solved the problem? Would you please share your solution? thx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: NCCL Error 2: unhandled system error #198

RuntimeError: NCCL Error 2: unhandled system error #198

waduhekx commented Aug 23, 2021

waduhekx commented Aug 23, 2021

Luffy03 commented Feb 10, 2022

RuntimeError: NCCL Error 2: unhandled system error #198

RuntimeError: NCCL Error 2: unhandled system error #198

Comments

waduhekx commented Aug 23, 2021

when i use two gpus to run the main.py to train model on sthv2 dataset, got error as below:

waduhekx commented Aug 23, 2021

Luffy03 commented Feb 10, 2022