Skip to content

Commit

Permalink
ADLR/megatron-lm!2091 - Don't pass device_id to torch.distributed.ini…
Browse files Browse the repository at this point in the history
…t_process_group, it causes hangs

Co-authored-by: Szymon Migacz <[email protected]>
  • Loading branch information
2 people authored and ko3n1g committed Sep 12, 2024
1 parent 8fc7553 commit dcc6634
Showing 1 changed file with 0 additions and 2 deletions.
2 changes: 0 additions & 2 deletions megatron/training/initialize.py
Original file line number Diff line number Diff line change
Expand Up @@ -254,8 +254,6 @@ def _initialize_distributed(get_embedding_ranks, get_position_embedding_ranks):
'rank': args.rank,
'timeout': timedelta(minutes=args.distributed_timeout_minutes),
}
if packaging.version.Version(torch.__version__) >= packaging.version.Version("2.3.0"):
init_process_group_kwargs['device_id'] = device_id

torch.distributed.init_process_group(**init_process_group_kwargs)

Expand Down

0 comments on commit dcc6634

Please sign in to comment.