Open
Description
To decrease inter-node communication volume, we intend to configure the ranks within the Pipeline Parallel process group to facilitate inter-node communication. Maybe self.grid of ProcessGroupManager should be:
self.grid = torch.arange(self.world_size).view(pp_size, dp_size, cp_size, tp_size) # PP * DP * CP * TP grid
instead of
# https://github.com/huggingface/picotron/blob/df3ae8a5f0cce213816b6b287b7febc75ab98a53/picotron/process_group_manager.py#L13
self.grid = torch.arange(self.world_size).view(dp_size, pp_size, cp_size, tp_size) # DP * PP * CP * TP grid
Metadata
Metadata
Assignees
Labels
No labels