You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
when i use two gpus to run the main.py to train model on sthv2 dataset, got error as below:
Traceback (most recent call last):
File "main.py", line 378, in
main()
File "main.py", line 194, in main
train(train_loader, model, criterion, optimizer, epoch, log_training, tf_writer)
File "main.py", line 244, in train
output = model(input_var)
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 151, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 156, in replicate
return replicate(module, device_ids, not torch.is_grad_enabled())
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/parallel/replicate.py", line 88, in replicate
param_copies = _broadcast_coalesced_reshape(params, devices, detach)
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/parallel/replicate.py", line 71, in _broadcast_coalesced_reshape
tensor_copies = Broadcast.apply(devices, *tensors)
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 21, in forward
outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/cuda/comm.py", line 39, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: NCCL Error 2: unhandled system error
The text was updated successfully, but these errors were encountered:
when i use two gpus to run the main.py to train model on sthv2 dataset, got error as below:
Traceback (most recent call last):
File "main.py", line 378, in
main()
File "main.py", line 194, in main
train(train_loader, model, criterion, optimizer, epoch, log_training, tf_writer)
File "main.py", line 244, in train
output = model(input_var)
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 151, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 156, in replicate
return replicate(module, device_ids, not torch.is_grad_enabled())
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/parallel/replicate.py", line 88, in replicate
param_copies = _broadcast_coalesced_reshape(params, devices, detach)
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/parallel/replicate.py", line 71, in _broadcast_coalesced_reshape
tensor_copies = Broadcast.apply(devices, *tensors)
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 21, in forward
outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
File "/home/simon/anaconda3/envs/tsm/lib/python3.8/site-packages/torch/cuda/comm.py", line 39, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: NCCL Error 2: unhandled system error
The text was updated successfully, but these errors were encountered: