Skip to content

Commit

Permalink
Merge branch 'no_dist_device_id' into 'main'
Browse files Browse the repository at this point in the history
Don't pass device_id to torch.distributed.init_process_group, it causes hangs

See merge request ADLR/megatron-lm!2091
  • Loading branch information
ko3n1g committed Sep 12, 2024
2 parents 028b777 + dcc6634 commit 76f9f48
Showing 1 changed file with 0 additions and 2 deletions.
2 changes: 0 additions & 2 deletions megatron/training/initialize.py
Original file line number Diff line number Diff line change
Expand Up @@ -254,8 +254,6 @@ def _initialize_distributed(get_embedding_ranks, get_position_embedding_ranks):
'rank': args.rank,
'timeout': timedelta(minutes=args.distributed_timeout_minutes),
}
if packaging.version.Version(torch.__version__) >= packaging.version.Version("2.3.0"):
init_process_group_kwargs['device_id'] = device_id

torch.distributed.init_process_group(**init_process_group_kwargs)

Expand Down

0 comments on commit 76f9f48

Please sign in to comment.