Skip to content

Commit

Permalink
Fixes in Imagenet training script (#1224)
Browse files Browse the repository at this point in the history
Fixes
* ngpus_per_node variable count fix
* dist_backend assertion added for nccl issue
  • Loading branch information
Jaiaid authored Jan 30, 2024
1 parent 76cd9d0 commit a848347
Showing 1 changed file with 5 additions and 1 deletion.
6 changes: 5 additions & 1 deletion imagenet/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -106,8 +106,12 @@ def main():

if torch.cuda.is_available():
ngpus_per_node = torch.cuda.device_count()
assert not (ngpus_per_node == 1 and args.dist_backend == "nccl"),\
"nccl backend requires GPU count>1, see https://github.com/NVIDIA/nccl/issues/103 perhaps use 'gloo'"
else:
ngpus_per_node = 1
ngpus_per_node = 0
assert args.dist_backend != "nccl",\
"nccl backend does not work without GPU, see https://pytorch.org/docs/stable/distributed.html"
if args.multiprocessing_distributed:
# Since we have ngpus_per_node processes per node, the total world_size
# needs to be adjusted accordingly
Expand Down

0 comments on commit a848347

Please sign in to comment.