Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docker image for BERT e2e training task #454

Merged
merged 19 commits into from
Aug 1, 2024
Merged

Conversation

mattcjo
Copy link
Contributor

@mattcjo mattcjo commented Jun 26, 2024

Issue #, if available:

Description of changes:
A distributed training script (e2e2/test/images/bert-training/train.py) has been added, along with it's dependencies (e2e2/test/images/bert-training/requirements.txt), in a new docker file (e2e2/test/images/bert-training/Dockerfile.bert-training). Building the dockerfile will produce an image that will run a distributed BERT training job.

The testing of the docker image took place on a p3.16xlarge instance utilizing the AMI: ami-05e885690ca33b527. The goal of the image is to start a process per GPU, creating an isolated training process per GPU, and then for there to be communication between each process, consolidating the weights from each process. The test is ran for a single epoch.

The results of the test show that the running the docker image starts up and executes the distributed BERT training job as expected :

export MASTER_ADDR='localhost'
export MASTER_PORT='12355'

docker run --gpus all --rm -e MASTER_ADDR -e MASTER_PORT aws-bert-mpi-training:latest mpirun -np 8 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x MASTER_ADDR -x MASTER_PORT --allow-run-as-root python train.py 

==========
== CUDA ==
==========

CUDA Version 12.5.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

/usr/local/lib/python3.11/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/usr/local/lib/python3.11/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/usr/local/lib/python3.11/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/usr/local/lib/python3.11/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/usr/local/lib/python3.11/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/usr/local/lib/python3.11/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/usr/local/lib/python3.11/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/usr/local/lib/python3.11/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
[W socket.cpp:697] [c10d] The client socket has failed to connect to [localhost]:12355 (errno: 99 - Cannot assign requested address).
[W socket.cpp:697] [c10d] The client socket has failed to connect to [localhost]:12355 (errno: 99 - Cannot assign requested address).
Process 2 initialized, using GPU 2
Process 5 initialized, using GPU 5
Process 6 initialized, using GPU 6
Process 3 initialized, using GPU 3
Process 4 initialized, using GPU 4
Process 0 initialized, using GPU 0
Process 7 initialized, using GPU 7
Process 1 initialized, using GPU 1
b391fedc46b4:32:32 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
b391fedc46b4:32:32 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
b391fedc46b4:32:32 [0] NCCL INFO cudaDriverVersion 12050
NCCL version 2.20.5+cuda12.4
b391fedc46b4:36:36 [4] NCCL INFO cudaDriverVersion 12050
b391fedc46b4:36:36 [4] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
b391fedc46b4:38:38 [6] NCCL INFO cudaDriverVersion 12050
b391fedc46b4:36:36 [4] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
b391fedc46b4:35:35 [3] NCCL INFO cudaDriverVersion 12050
b391fedc46b4:38:38 [6] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
b391fedc46b4:37:37 [5] NCCL INFO cudaDriverVersion 12050
b391fedc46b4:38:38 [6] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
b391fedc46b4:37:37 [5] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
b391fedc46b4:35:35 [3] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
b391fedc46b4:35:35 [3] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
b391fedc46b4:37:37 [5] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
b391fedc46b4:34:34 [2] NCCL INFO cudaDriverVersion 12050
b391fedc46b4:34:34 [2] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
b391fedc46b4:34:34 [2] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
b391fedc46b4:39:39 [7] NCCL INFO cudaDriverVersion 12050
b391fedc46b4:39:39 [7] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
b391fedc46b4:39:39 [7] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
b391fedc46b4:32:75 [0] NCCL INFO NET/IB : No device found.
b391fedc46b4:32:75 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
b391fedc46b4:32:75 [0] NCCL INFO Using non-device net plugin version 0
b391fedc46b4:32:75 [0] NCCL INFO Using network Socket
b391fedc46b4:33:33 [1] NCCL INFO cudaDriverVersion 12050
b391fedc46b4:33:33 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
b391fedc46b4:33:33 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
b391fedc46b4:36:76 [4] NCCL INFO NET/IB : No device found.
b391fedc46b4:36:76 [4] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
b391fedc46b4:36:76 [4] NCCL INFO Using non-device net plugin version 0
b391fedc46b4:36:76 [4] NCCL INFO Using network Socket
b391fedc46b4:35:78 [3] NCCL INFO NET/IB : No device found.
b391fedc46b4:35:78 [3] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
b391fedc46b4:35:78 [3] NCCL INFO Using non-device net plugin version 0
b391fedc46b4:35:78 [3] NCCL INFO Using network Socket
b391fedc46b4:38:77 [6] NCCL INFO NET/IB : No device found.
b391fedc46b4:38:77 [6] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
b391fedc46b4:38:77 [6] NCCL INFO Using non-device net plugin version 0
b391fedc46b4:38:77 [6] NCCL INFO Using network Socket
b391fedc46b4:37:79 [5] NCCL INFO NET/IB : No device found.
b391fedc46b4:37:79 [5] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
b391fedc46b4:37:79 [5] NCCL INFO Using non-device net plugin version 0
b391fedc46b4:37:79 [5] NCCL INFO Using network Socket
b391fedc46b4:34:80 [2] NCCL INFO NET/IB : No device found.
b391fedc46b4:34:80 [2] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
b391fedc46b4:34:80 [2] NCCL INFO Using non-device net plugin version 0
b391fedc46b4:34:80 [2] NCCL INFO Using network Socket
b391fedc46b4:39:81 [7] NCCL INFO NET/IB : No device found.
b391fedc46b4:39:81 [7] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
b391fedc46b4:39:81 [7] NCCL INFO Using non-device net plugin version 0
b391fedc46b4:39:81 [7] NCCL INFO Using network Socket
b391fedc46b4:33:82 [1] NCCL INFO NET/IB : No device found.
b391fedc46b4:33:82 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
b391fedc46b4:33:82 [1] NCCL INFO Using non-device net plugin version 0
b391fedc46b4:33:82 [1] NCCL INFO Using network Socket
b391fedc46b4:34:80 [2] NCCL INFO comm 0x8a3b540 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 190 commId 0xdd45beef26d12c93 - Init START
b391fedc46b4:37:79 [5] NCCL INFO comm 0x9c86040 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId 1c0 commId 0xdd45beef26d12c93 - Init START
b391fedc46b4:36:76 [4] NCCL INFO comm 0x8c08d40 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 1b0 commId 0xdd45beef26d12c93 - Init START
b391fedc46b4:39:81 [7] NCCL INFO comm 0x147271f0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId 1e0 commId 0xdd45beef26d12c93 - Init START
b391fedc46b4:38:77 [6] NCCL INFO comm 0x95cdea0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId 1d0 commId 0xdd45beef26d12c93 - Init START
b391fedc46b4:35:78 [3] NCCL INFO comm 0x9574d80 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 1a0 commId 0xdd45beef26d12c93 - Init START
b391fedc46b4:32:75 [0] NCCL INFO comm 0x1ddf1700 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 170 commId 0xdd45beef26d12c93 - Init START
b391fedc46b4:33:82 [1] NCCL INFO comm 0xa0b0200 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 180 commId 0xdd45beef26d12c93 - Init START
b391fedc46b4:32:75 [0] NCCL INFO NVLS multicast support is not available on dev 0
b391fedc46b4:37:79 [5] NCCL INFO NVLS multicast support is not available on dev 5
b391fedc46b4:36:76 [4] NCCL INFO NVLS multicast support is not available on dev 4
b391fedc46b4:33:82 [1] NCCL INFO NVLS multicast support is not available on dev 1
b391fedc46b4:39:81 [7] NCCL INFO NVLS multicast support is not available on dev 7
b391fedc46b4:38:77 [6] NCCL INFO NVLS multicast support is not available on dev 6
b391fedc46b4:35:78 [3] NCCL INFO NVLS multicast support is not available on dev 3
b391fedc46b4:34:80 [2] NCCL INFO NVLS multicast support is not available on dev 2
b391fedc46b4:32:75 [0] NCCL INFO comm 0x1ddf1700 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 0
b391fedc46b4:32:75 [0] NCCL INFO Channel 00/12 :    0   3   2   1   5   6   7   4
b391fedc46b4:32:75 [0] NCCL INFO Channel 01/12 :    0   3   2   1   5   6   7   4
b391fedc46b4:32:75 [0] NCCL INFO Channel 02/12 :    0   4   7   6   5   1   2   3
b391fedc46b4:32:75 [0] NCCL INFO Channel 03/12 :    0   4   7   6   5   1   2   3
b391fedc46b4:32:75 [0] NCCL INFO Channel 04/12 :    0   1   3   7   5   4   6   2
b391fedc46b4:37:79 [5] NCCL INFO comm 0x9c86040 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 0
b391fedc46b4:37:79 [5] NCCL INFO Trees [0] 6/-1/-1->5->1 [1] 6/-1/-1->5->1 [2] 1/-1/-1->5->6 [3] 1/-1/-1->5->6 [4] 4/-1/-1->5->7 [5] 7/-1/-1->5->4 [6] 6/-1/-1->5->1 [7] 6/-1/-1->5->1 [8] 1/-1/-1->5->6 [9] 1/-1/-1->5->6 [10] 4/-1/-1->5->7 [11] 7/-1/-1->5->4
b391fedc46b4:37:79 [5] NCCL INFO P2P Chunksize set to 524288
b391fedc46b4:33:82 [1] NCCL INFO comm 0xa0b0200 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0
b391fedc46b4:33:82 [1] NCCL INFO Trees [0] 5/-1/-1->1->2 [1] 5/-1/-1->1->2 [2] 2/-1/-1->1->5 [3] 2/-1/-1->1->5 [4] 3/-1/-1->1->0 [5] -1/-1/-1->1->3 [6] 5/-1/-1->1->2 [7] 5/-1/-1->1->2 [8] 2/-1/-1->1->5 [9] 2/-1/-1->1->5 [10] 3/-1/-1->1->0 [11] -1/-1/-1->1->3
b391fedc46b4:33:82 [1] NCCL INFO P2P Chunksize set to 524288
b391fedc46b4:39:81 [7] NCCL INFO comm 0x147271f0 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 0
b391fedc46b4:39:81 [7] NCCL INFO Trees [0] 4/-1/-1->7->6 [1] 4/-1/-1->7->6 [2] 6/-1/-1->7->4 [3] 6/-1/-1->7->4 [4] 5/-1/-1->7->3 [5] 3/-1/-1->7->5 [6] 4/-1/-1->7->6 [7] 4/-1/-1->7->6 [8] 6/-1/-1->7->4 [9] 6/-1/-1->7->4 [10] 5/-1/-1->7->3 [11] 3/-1/-1->7->5
b391fedc46b4:39:81 [7] NCCL INFO P2P Chunksize set to 524288
b391fedc46b4:38:77 [6] NCCL INFO comm 0x95cdea0 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 0
b391fedc46b4:38:77 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 5/-1/-1->6->7 [3] 5/-1/-1->6->7 [4] 2/-1/-1->6->4 [5] 4/-1/-1->6->2 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 5/-1/-1->6->7 [9] 5/-1/-1->6->7 [10] 2/-1/-1->6->4 [11] 4/-1/-1->6->2
b391fedc46b4:38:77 [6] NCCL INFO P2P Chunksize set to 524288
b391fedc46b4:34:80 [2] NCCL INFO comm 0x8a3b540 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 0
b391fedc46b4:34:80 [2] NCCL INFO Trees [0] 1/-1/-1->2->3 [1] 1/-1/-1->2->3 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] -1/-1/-1->2->6 [5] 6/-1/-1->2->0 [6] 1/-1/-1->2->3 [7] 1/-1/-1->2->3 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] -1/-1/-1->2->6 [11] 6/-1/-1->2->0
b391fedc46b4:34:80 [2] NCCL INFO P2P Chunksize set to 524288
b391fedc46b4:36:76 [4] NCCL INFO comm 0x8c08d40 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 0
b391fedc46b4:36:76 [4] NCCL INFO Trees [0] -1/-1/-1->4->7 [1] -1/-1/-1->4->7 [2] 7/-1/-1->4->0 [3] 7/-1/-1->4->0 [4] 6/-1/-1->4->5 [5] 5/-1/-1->4->6 [6] -1/-1/-1->4->7 [7] -1/-1/-1->4->7 [8] 7/-1/-1->4->0 [9] 7/-1/-1->4->0 [10] 6/-1/-1->4->5 [11] 5/-1/-1->4->6
b391fedc46b4:36:76 [4] NCCL INFO P2P Chunksize set to 524288
b391fedc46b4:35:78 [3] NCCL INFO comm 0x9574d80 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 0
b391fedc46b4:35:78 [3] NCCL INFO Trees [0] 2/-1/-1->3->0 [1] 2/-1/-1->3->0 [2] -1/-1/-1->3->2 [3] -1/-1/-1->3->2 [4] 7/-1/-1->3->1 [5] 1/-1/-1->3->7 [6] 2/-1/-1->3->0 [7] 2/-1/-1->3->0 [8] -1/-1/-1->3->2 [9] -1/-1/-1->3->2 [10] 7/-1/-1->3->1 [11] 1/-1/-1->3->7
b391fedc46b4:35:78 [3] NCCL INFO P2P Chunksize set to 524288
b391fedc46b4:32:75 [0] NCCL INFO Channel 05/12 :    0   2   6   4   5   7   3   1
b391fedc46b4:32:75 [0] NCCL INFO Channel 06/12 :    0   3   2   1   5   6   7   4
b391fedc46b4:32:75 [0] NCCL INFO Channel 07/12 :    0   3   2   1   5   6   7   4
b391fedc46b4:32:75 [0] NCCL INFO Channel 08/12 :    0   4   7   6   5   1   2   3
b391fedc46b4:32:75 [0] NCCL INFO Channel 09/12 :    0   4   7   6   5   1   2   3
b391fedc46b4:32:75 [0] NCCL INFO Channel 10/12 :    0   1   3   7   5   4   6   2
b391fedc46b4:32:75 [0] NCCL INFO Channel 11/12 :    0   2   6   4   5   7   3   1
b391fedc46b4:32:75 [0] NCCL INFO Trees [0] 3/-1/-1->0->-1 [1] 3/-1/-1->0->-1 [2] 4/-1/-1->0->-1 [3] 4/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 2/-1/-1->0->-1 [6] 3/-1/-1->0->-1 [7] 3/-1/-1->0->-1 [8] 4/-1/-1->0->-1 [9] 4/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 2/-1/-1->0->-1
b391fedc46b4:32:75 [0] NCCL INFO P2P Chunksize set to 524288
b391fedc46b4:36:76 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/CUMEM
b391fedc46b4:32:75 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/CUMEM
b391fedc46b4:32:75 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 04/0 : 4[4] -> 6[6] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 05/0 : 5[5] -> 7[7] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 10/0 : 4[4] -> 6[6] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 11/0 : 5[5] -> 7[7] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 02/0 : 4[4] -> 7[7] via P2P/CUMEM
b391fedc46b4:32:75 [0] NCCL INFO Channel 05/0 : 0[0] -> 2[2] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 04/0 : 1[1] -> 3[3] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 03/0 : 4[4] -> 7[7] via P2P/CUMEM
b391fedc46b4:32:75 [0] NCCL INFO Channel 11/0 : 0[0] -> 2[2] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 10/0 : 1[1] -> 3[3] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 08/0 : 4[4] -> 7[7] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 04/0 : 6[6] -> 2[2] via P2P/CUMEM
b391fedc46b4:32:75 [0] NCCL INFO Channel 00/0 : 0[0] -> 3[3] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 09/0 : 4[4] -> 7[7] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 10/0 : 6[6] -> 2[2] via P2P/CUMEM
b391fedc46b4:32:75 [0] NCCL INFO Channel 01/0 : 0[0] -> 3[3] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 02/0 : 5[5] -> 1[1] via P2P/CUMEM
b391fedc46b4:32:75 [0] NCCL INFO Channel 06/0 : 0[0] -> 3[3] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 05/0 : 2[2] -> 6[6] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 03/0 : 5[5] -> 1[1] via P2P/CUMEM
b391fedc46b4:32:75 [0] NCCL INFO Channel 07/0 : 0[0] -> 3[3] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 11/0 : 2[2] -> 6[6] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 08/0 : 5[5] -> 1[1] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 05/0 : 7[7] -> 3[3] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 00/0 : 1[1] -> 5[5] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 05/0 : 6[6] -> 4[4] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 04/0 : 2[2] -> 0[0] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 09/0 : 5[5] -> 1[1] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 11/0 : 7[7] -> 3[3] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 01/0 : 1[1] -> 5[5] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 11/0 : 6[6] -> 4[4] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 10/0 : 2[2] -> 0[0] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 06/0 : 1[1] -> 5[5] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 00/0 : 4[4] -> 0[0] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 04/0 : 3[3] -> 7[7] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 07/0 : 1[1] -> 5[5] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 01/0 : 4[4] -> 0[0] via P2P/CUMEM
b391fedc46b4:32:75 [0] NCCL INFO Channel 02/0 : 0[0] -> 4[4] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 10/0 : 3[3] -> 7[7] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 06/0 : 4[4] -> 0[0] via P2P/CUMEM
b391fedc46b4:32:75 [0] NCCL INFO Channel 03/0 : 0[0] -> 4[4] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 00/0 : 7[7] -> 4[4] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 02/0 : 3[3] -> 0[0] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 07/0 : 4[4] -> 0[0] via P2P/CUMEM
b391fedc46b4:32:75 [0] NCCL INFO Channel 08/0 : 0[0] -> 4[4] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 01/0 : 7[7] -> 4[4] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 03/0 : 3[3] -> 0[0] via P2P/CUMEM
b391fedc46b4:32:75 [0] NCCL INFO Channel 09/0 : 0[0] -> 4[4] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 06/0 : 7[7] -> 4[4] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 08/0 : 3[3] -> 0[0] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 07/0 : 7[7] -> 4[4] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 09/0 : 3[3] -> 0[0] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 04/0 : 7[7] -> 5[5] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 05/0 : 3[3] -> 1[1] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 10/0 : 7[7] -> 5[5] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 11/0 : 3[3] -> 1[1] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 02/0 : 7[7] -> 6[6] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 03/0 : 7[7] -> 6[6] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 08/0 : 7[7] -> 6[6] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 09/0 : 7[7] -> 6[6] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 04/0 : 5[5] -> 4[4] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 02/0 : 6[6] -> 5[5] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 10/0 : 5[5] -> 4[4] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 03/0 : 6[6] -> 5[5] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 08/0 : 6[6] -> 5[5] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 09/0 : 6[6] -> 5[5] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Connected all rings
b391fedc46b4:36:76 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/CUMEM
b391fedc46b4:32:75 [0] NCCL INFO Connected all rings
b391fedc46b4:39:81 [7] NCCL INFO Connected all rings
b391fedc46b4:33:82 [1] NCCL INFO Connected all rings
b391fedc46b4:33:82 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Connected all rings
b391fedc46b4:35:78 [3] NCCL INFO Connected all rings
b391fedc46b4:37:79 [5] NCCL INFO Connected all rings
b391fedc46b4:38:77 [6] NCCL INFO Connected all rings
b391fedc46b4:36:76 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 05/0 : 4[4] -> 6[6] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 11/0 : 4[4] -> 6[6] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 04/0 : 5[5] -> 7[7] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 05/0 : 1[1] -> 3[3] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 10/0 : 5[5] -> 7[7] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 11/0 : 1[1] -> 3[3] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 00/0 : 4[4] -> 7[7] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 04/0 : 2[2] -> 6[6] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 01/0 : 4[4] -> 7[7] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 10/0 : 2[2] -> 6[6] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 05/0 : 6[6] -> 2[2] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 06/0 : 4[4] -> 7[7] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 05/0 : 3[3] -> 7[7] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 11/0 : 6[6] -> 2[2] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 07/0 : 4[4] -> 7[7] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 11/0 : 3[3] -> 7[7] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 04/0 : 6[6] -> 4[4] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 05/0 : 2[2] -> 0[0] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 02/0 : 4[4] -> 0[0] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 00/0 : 5[5] -> 1[1] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 02/0 : 1[1] -> 5[5] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 10/0 : 6[6] -> 4[4] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 11/0 : 2[2] -> 0[0] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 03/0 : 4[4] -> 0[0] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 01/0 : 5[5] -> 1[1] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 03/0 : 1[1] -> 5[5] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 04/0 : 7[7] -> 3[3] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 08/0 : 4[4] -> 0[0] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 06/0 : 5[5] -> 1[1] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 08/0 : 1[1] -> 5[5] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 10/0 : 7[7] -> 3[3] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 09/0 : 4[4] -> 0[0] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 07/0 : 5[5] -> 1[1] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 09/0 : 1[1] -> 5[5] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 00/0 : 3[3] -> 0[0] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 02/0 : 7[7] -> 4[4] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 01/0 : 3[3] -> 0[0] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 03/0 : 7[7] -> 4[4] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 06/0 : 3[3] -> 0[0] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 08/0 : 7[7] -> 4[4] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 07/0 : 3[3] -> 0[0] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 09/0 : 7[7] -> 4[4] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 05/0 : 7[7] -> 5[5] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 04/0 : 3[3] -> 1[1] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 11/0 : 7[7] -> 5[5] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 10/0 : 3[3] -> 1[1] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 00/0 : 7[7] -> 6[6] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 01/0 : 7[7] -> 6[6] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 06/0 : 7[7] -> 6[6] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 07/0 : 7[7] -> 6[6] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 05/0 : 5[5] -> 4[4] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 11/0 : 5[5] -> 4[4] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 08/0 : 3[3] -> 2[2] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 09/0 : 3[3] -> 2[2] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 06/0 : 6[6] -> 5[5] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 07/0 : 6[6] -> 5[5] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 08/0 : 2[2] -> 1[1] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Connected all trees
b391fedc46b4:36:76 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
b391fedc46b4:36:76 [4] NCCL INFO 12 coll channels, 0 collnet channels, 0 nvls channels, 16 p2p channels, 2 p2p channels per peer
b391fedc46b4:39:81 [7] NCCL INFO Connected all trees
b391fedc46b4:39:81 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
b391fedc46b4:39:81 [7] NCCL INFO 12 coll channels, 0 collnet channels, 0 nvls channels, 16 p2p channels, 2 p2p channels per peer
b391fedc46b4:39:81 [7] NCCL INFO Channel 08/1 : 7[7] -> 0[0] via P2P/indirect/4[4]
b391fedc46b4:38:77 [6] NCCL INFO Connected all trees
b391fedc46b4:38:77 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
b391fedc46b4:38:77 [6] NCCL INFO 12 coll channels, 0 collnet channels, 0 nvls channels, 16 p2p channels, 2 p2p channels per peer
b391fedc46b4:38:77 [6] NCCL INFO Channel 04/1 : 6[6] -> 0[0] via P2P/indirect/4[4]
b391fedc46b4:37:79 [5] NCCL INFO Connected all trees
b391fedc46b4:37:79 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
b391fedc46b4:37:79 [5] NCCL INFO 12 coll channels, 0 collnet channels, 0 nvls channels, 16 p2p channels, 2 p2p channels per peer
b391fedc46b4:39:81 [7] NCCL INFO Channel 09/1 : 7[7] -> 0[0] via P2P/indirect/4[4]
b391fedc46b4:32:75 [0] NCCL INFO Connected all trees
b391fedc46b4:33:82 [1] NCCL INFO Connected all trees
b391fedc46b4:33:82 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
b391fedc46b4:33:82 [1] NCCL INFO 12 coll channels, 0 collnet channels, 0 nvls channels, 16 p2p channels, 2 p2p channels per peer
b391fedc46b4:35:78 [3] NCCL INFO Connected all trees
b391fedc46b4:35:78 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
b391fedc46b4:35:78 [3] NCCL INFO 12 coll channels, 0 collnet channels, 0 nvls channels, 16 p2p channels, 2 p2p channels per peer
b391fedc46b4:32:75 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
b391fedc46b4:32:75 [0] NCCL INFO 12 coll channels, 0 collnet channels, 0 nvls channels, 16 p2p channels, 2 p2p channels per peer
b391fedc46b4:34:80 [2] NCCL INFO Connected all trees
b391fedc46b4:35:78 [3] NCCL INFO Channel 08/1 : 3[3] -> 4[4] via P2P/indirect/0[0]
b391fedc46b4:34:80 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
b391fedc46b4:34:80 [2] NCCL INFO 12 coll channels, 0 collnet channels, 0 nvls channels, 16 p2p channels, 2 p2p channels per peer
b391fedc46b4:34:80 [2] NCCL INFO Channel 04/1 : 2[2] -> 4[4] via P2P/indirect/0[0]
b391fedc46b4:38:77 [6] NCCL INFO Channel 05/1 : 6[6] -> 0[0] via P2P/indirect/4[4]
b391fedc46b4:35:78 [3] NCCL INFO Channel 09/1 : 3[3] -> 4[4] via P2P/indirect/0[0]
b391fedc46b4:34:80 [2] NCCL INFO Channel 05/1 : 2[2] -> 4[4] via P2P/indirect/0[0]
b391fedc46b4:35:78 [3] NCCL INFO Channel 04/1 : 3[3] -> 5[5] via P2P/indirect/1[1]
b391fedc46b4:39:81 [7] NCCL INFO Channel 04/1 : 7[7] -> 1[1] via P2P/indirect/5[5]
b391fedc46b4:35:78 [3] NCCL INFO Channel 05/1 : 3[3] -> 5[5] via P2P/indirect/1[1]
b391fedc46b4:39:81 [7] NCCL INFO Channel 05/1 : 7[7] -> 1[1] via P2P/indirect/5[5]
b391fedc46b4:35:78 [3] NCCL INFO Channel 12/1 : 3[3] -> 6[6] via P2P/indirect/7[7]
b391fedc46b4:39:81 [7] NCCL INFO Channel 12/1 : 7[7] -> 2[2] via P2P/indirect/3[3]
b391fedc46b4:35:78 [3] NCCL INFO Channel 13/1 : 3[3] -> 6[6] via P2P/indirect/7[7]
b391fedc46b4:39:81 [7] NCCL INFO Channel 13/1 : 7[7] -> 2[2] via P2P/indirect/3[3]
b391fedc46b4:34:80 [2] NCCL INFO Channel 12/1 : 2[2] -> 5[5] via P2P/indirect/1[1]
b391fedc46b4:37:79 [5] NCCL INFO Channel 12/1 : 5[5] -> 0[0] via P2P/indirect/4[4]
b391fedc46b4:38:77 [6] NCCL INFO Channel 12/1 : 6[6] -> 1[1] via P2P/indirect/5[5]
b391fedc46b4:33:82 [1] NCCL INFO Channel 12/1 : 1[1] -> 4[4] via P2P/indirect/0[0]
b391fedc46b4:38:77 [6] NCCL INFO Channel 13/1 : 6[6] -> 1[1] via P2P/indirect/5[5]
b391fedc46b4:34:80 [2] NCCL INFO Channel 13/1 : 2[2] -> 5[5] via P2P/indirect/1[1]
b391fedc46b4:37:79 [5] NCCL INFO Channel 13/1 : 5[5] -> 0[0] via P2P/indirect/4[4]
b391fedc46b4:33:82 [1] NCCL INFO Channel 13/1 : 1[1] -> 4[4] via P2P/indirect/0[0]
b391fedc46b4:32:75 [0] NCCL INFO Channel 10/1 : 0[0] -> 5[5] via P2P/indirect/1[1]
b391fedc46b4:36:76 [4] NCCL INFO Channel 10/1 : 4[4] -> 1[1] via P2P/indirect/5[5]
b391fedc46b4:36:76 [4] NCCL INFO Channel 11/1 : 4[4] -> 1[1] via P2P/indirect/5[5]
b391fedc46b4:32:75 [0] NCCL INFO Channel 11/1 : 0[0] -> 5[5] via P2P/indirect/1[1]
b391fedc46b4:34:80 [2] NCCL INFO Channel 10/1 : 2[2] -> 7[7] via P2P/indirect/6[6]
b391fedc46b4:33:82 [1] NCCL INFO Channel 10/1 : 1[1] -> 6[6] via P2P/indirect/5[5]
b391fedc46b4:37:79 [5] NCCL INFO Channel 10/1 : 5[5] -> 2[2] via P2P/indirect/1[1]
b391fedc46b4:38:77 [6] NCCL INFO Channel 10/1 : 6[6] -> 3[3] via P2P/indirect/2[2]
b391fedc46b4:34:80 [2] NCCL INFO Channel 11/1 : 2[2] -> 7[7] via P2P/indirect/6[6]
b391fedc46b4:37:79 [5] NCCL INFO Channel 11/1 : 5[5] -> 2[2] via P2P/indirect/1[1]
b391fedc46b4:33:82 [1] NCCL INFO Channel 11/1 : 1[1] -> 6[6] via P2P/indirect/5[5]
b391fedc46b4:38:77 [6] NCCL INFO Channel 11/1 : 6[6] -> 3[3] via P2P/indirect/2[2]
b391fedc46b4:32:75 [0] NCCL INFO Channel 06/1 : 0[0] -> 6[6] via P2P/indirect/4[4]
b391fedc46b4:37:79 [5] NCCL INFO Channel 06/1 : 5[5] -> 3[3] via P2P/indirect/1[1]
b391fedc46b4:36:76 [4] NCCL INFO Channel 06/1 : 4[4] -> 2[2] via P2P/indirect/6[6]
b391fedc46b4:33:82 [1] NCCL INFO Channel 06/1 : 1[1] -> 7[7] via P2P/indirect/3[3]
b391fedc46b4:32:75 [0] NCCL INFO Channel 07/1 : 0[0] -> 6[6] via P2P/indirect/4[4]
b391fedc46b4:33:82 [1] NCCL INFO Channel 07/1 : 1[1] -> 7[7] via P2P/indirect/3[3]
b391fedc46b4:37:79 [5] NCCL INFO Channel 07/1 : 5[5] -> 3[3] via P2P/indirect/1[1]
b391fedc46b4:36:76 [4] NCCL INFO Channel 07/1 : 4[4] -> 2[2] via P2P/indirect/6[6]
b391fedc46b4:32:75 [0] NCCL INFO Channel 14/1 : 0[0] -> 7[7] via P2P/indirect/4[4]
b391fedc46b4:36:76 [4] NCCL INFO Channel 14/1 : 4[4] -> 3[3] via P2P/indirect/0[0]
b391fedc46b4:36:76 [4] NCCL INFO Channel 15/1 : 4[4] -> 3[3] via P2P/indirect/0[0]
b391fedc46b4:32:75 [0] NCCL INFO Channel 15/1 : 0[0] -> 7[7] via P2P/indirect/4[4]
b391fedc46b4:39:81 [7] NCCL INFO comm 0x147271f0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId 1e0 commId 0xdd45beef26d12c93 - Init COMPLETE
b391fedc46b4:35:78 [3] NCCL INFO comm 0x9574d80 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 1a0 commId 0xdd45beef26d12c93 - Init COMPLETE
b391fedc46b4:37:79 [5] NCCL INFO comm 0x9c86040 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId 1c0 commId 0xdd45beef26d12c93 - Init COMPLETE
b391fedc46b4:32:75 [0] NCCL INFO comm 0x1ddf1700 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 170 commId 0xdd45beef26d12c93 - Init COMPLETE
b391fedc46b4:33:82 [1] NCCL INFO comm 0xa0b0200 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 180 commId 0xdd45beef26d12c93 - Init COMPLETE
b391fedc46b4:36:76 [4] NCCL INFO comm 0x8c08d40 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 1b0 commId 0xdd45beef26d12c93 - Init COMPLETE
b391fedc46b4:34:80 [2] NCCL INFO comm 0x8a3b540 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 190 commId 0xdd45beef26d12c93 - Init COMPLETE
b391fedc46b4:38:77 [6] NCCL INFO comm 0x95cdea0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId 1d0 commId 0xdd45beef26d12c93 - Init COMPLETE
Process 3 - Training time: 0.40 seconds
Process 3 - Throughput: 251.42 samples/second
Process 0 - Training time: 0.40 seconds
Process 0 - Throughput: 247.30 samples/second
Process 7 - Training time: 0.43 seconds
Process 7 - Throughput: 232.12 samples/second
Process 5 - Training time: 0.43 seconds
Process 5 - Throughput: 234.14 samples/second
Process 1 - Training time: 0.43 seconds
Process 1 - Throughput: 231.37 samples/second
Process 4 - Training time: 0.44 seconds
Process 4 - Throughput: 229.70 samples/second
Process 6 - Training time: 0.43 seconds
Process 6 - Throughput: 233.03 samples/second
Process 2 - Training time: 0.44 seconds
Process 2 - Throughput: 229.34 samples/second
b391fedc46b4:32:91 [0] NCCL INFO [Service thread] Connection closed by localRank 0
b391fedc46b4:36:86 [4] NCCL INFO [Service thread] Connection closed by localRank 0
b391fedc46b4:33:97 [1] NCCL INFO [Service thread] Connection closed by localRank 0
b391fedc46b4:35:93 [3] NCCL INFO [Service thread] Connection closed by localRank 7
b391fedc46b4:37:95 [5] NCCL INFO [Service thread] Connection closed by localRank 7
b391fedc46b4:36:86 [4] NCCL INFO [Service thread] Connection closed by localRank 7
b391fedc46b4:39:83 [7] NCCL INFO [Service thread] Connection closed by localRank 7
b391fedc46b4:32:91 [0] NCCL INFO [Service thread] Connection closed by localRank 4
b391fedc46b4:36:86 [4] NCCL INFO [Service thread] Connection closed by localRank 4
b391fedc46b4:37:95 [5] NCCL INFO [Service thread] Connection closed by localRank 4
b391fedc46b4:38:85 [6] NCCL INFO [Service thread] Connection closed by localRank 4
b391fedc46b4:32:91 [0] NCCL INFO [Service thread] Connection closed by localRank 3
b391fedc46b4:33:97 [1] NCCL INFO [Service thread] Connection closed by localRank 3
b391fedc46b4:35:93 [3] NCCL INFO [Service thread] Connection closed by localRank 3
b391fedc46b4:39:83 [7] NCCL INFO [Service thread] Connection closed by localRank 3
b391fedc46b4:32:91 [0] NCCL INFO [Service thread] Connection closed by localRank 2
b391fedc46b4:33:97 [1] NCCL INFO [Service thread] Connection closed by localRank 2
b391fedc46b4:38:85 [6] NCCL INFO [Service thread] Connection closed by localRank 2
b391fedc46b4:34:89 [2] NCCL INFO [Service thread] Connection closed by localRank 2
b391fedc46b4:32:91 [0] NCCL INFO [Service thread] Connection closed by localRank 1
b391fedc46b4:33:97 [1] NCCL INFO [Service thread] Connection closed by localRank 1
b391fedc46b4:35:93 [3] NCCL INFO [Service thread] Connection closed by localRank 1
b391fedc46b4:34:89 [2] NCCL INFO [Service thread] Connection closed by localRank 6
b391fedc46b4:36:86 [4] NCCL INFO [Service thread] Connection closed by localRank 6
b391fedc46b4:37:95 [5] NCCL INFO [Service thread] Connection closed by localRank 1
b391fedc46b4:37:95 [5] NCCL INFO [Service thread] Connection closed by localRank 6
b391fedc46b4:38:85 [6] NCCL INFO [Service thread] Connection closed by localRank 6
b391fedc46b4:36:86 [4] NCCL INFO [Service thread] Connection closed by localRank 5
b391fedc46b4:33:97 [1] NCCL INFO [Service thread] Connection closed by localRank 5
b391fedc46b4:37:95 [5] NCCL INFO [Service thread] Connection closed by localRank 5
b391fedc46b4:39:174 [0] NCCL INFO comm 0x147271f0 rank 7 nranks 8 cudaDev 7 busId 1e0 - Abort COMPLETE
b391fedc46b4:32:172 [0] NCCL INFO comm 0x1ddf1700 rank 0 nranks 8 cudaDev 0 busId 170 - Abort COMPLETE
b391fedc46b4:35:171 [0] NCCL INFO comm 0x9574d80 rank 3 nranks 8 cudaDev 3 busId 1a0 - Abort COMPLETE
b391fedc46b4:34:178 [0] NCCL INFO comm 0x8a3b540 rank 2 nranks 8 cudaDev 2 busId 190 - Abort COMPLETE
b391fedc46b4:38:177 [0] NCCL INFO comm 0x95cdea0 rank 6 nranks 8 cudaDev 6 busId 1d0 - Abort COMPLETE
b391fedc46b4:33:175 [0] NCCL INFO comm 0xa0b0200 rank 1 nranks 8 cudaDev 1 busId 180 - Abort COMPLETE
b391fedc46b4:36:176 [0] NCCL INFO comm 0x8c08d40 rank 4 nranks 8 cudaDev 4 busId 1b0 - Abort COMPLETE
b391fedc46b4:37:173 [0] NCCL INFO comm 0x9c86040 rank 5 nranks 8 cudaDev 5 busId 1c0 - Abort COMPLETE

Included in this PR is also the inclusion of a new github action to verify the docker image will build successfully.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

steps:
- uses: actions/checkout@v3
- run: docker build --file e2e2/test/images/bert-training/Dockerfile.bert-training e2e2/test/images/bert-training
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an idea for the future (since there will be more to come), maybe we standardizing a images/XXX/{Dockerfile, ...} structure for all our images and then create a job matrix for test image build

Comment on lines 40 to 41
os.environ['MASTER_ADDR'] = os.environ['MASTER_ADDR'] # Kubernetes sets this
os.environ['MASTER_PORT'] = os.environ['MASTER_PORT'] # Kubernetes sets this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this to docker ENV with defaults

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are your thoughts on just setting the defaults here in the script (I should have done this in the first place) versus in the dockerfile?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default values should only be used during local dev, which is why I could see them belonging in the training script. Kubernetes should set them at runtime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kubernetes should set them at runtime.

meaning its part of the container spec right? those env's should override what's in the docker ENV, so i think it just generally makes sense to put it in the Dockerfile/image definition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it would be in the container spec. Works for me, I'll make the change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ndbaker1 Done

Comment on lines +85 to +119
print(f"Process {rank} - Training time: {training_time:.2f} seconds")
print(f"Process {rank} - Throughput: {throughput:.2f} samples/second")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to dump this output to disk so we can use to upload to s3?

Copy link
Contributor Author

@mattcjo mattcjo Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially... I also was considering writing directly to s3 as well, but was curious to hear other's perspective(s). My intuition says writing to S3 is the long-term solution (once a stable schema is solidified), but short term just doing something like writing to disk or stdout might be the way to go.

Copy link
Contributor

@ndbaker1 ndbaker1 Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree it can go to s3, cloudwatch, or etc once we know where this is going. we should have this output also printed for sure though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be fine dump to disk for short term, and you have enough flexibility to play POC with different long-term destinations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Issacwww Are there any concerns/considerations with writing from the container to the host machine?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, good call out, this is on the tod worker, dump to disk has no difference between stdout... so stdout should be fine now.

@mattcjo mattcjo marked this pull request as draft July 1, 2024 21:15

start_time = time.time()

for epoch in range(1): # Short run for testing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we let the program read the epoch from an environment variable or argument? This way we can allow larger instances (e.g. p5) to run more epochs without changing the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the purpose of having more epochs for larger instance sizes? Are you thinking about it purely from the perspective of wanting the tests to last the same amount of time for each instance type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some random thoughts. I was thinking we could run more epochs for larger instances to get more accurate performance data. Additionally, we could reuse this code for our future long-running tests (like soak tests).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha... Yeah I certainly appreciate the idea behind re-usability, but there's a good chance this current test isn't the best option for a SOAP test anyways.

As far as more epochs for larger instance types, it depends on what your end goal is. For the tests we're running, and the metrics we're looking to gather, I don't see any benefit in doing this at this time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a discussion with @cartermckinnon. I think we could reuse the e2e2/test/images/nvidia/Dockerfile, so we don't need to maintain multiple images and dockerfiles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that's fine from a base image perspective, since many of the dependencies will be shared among test types (i.e. unit/training/inference), but training and inference will both require unique dependencies on top of what's included in e2e2/test/images/nvidia/Dockerfile. The dependencies between training and inference might be the same at this point, but this could very well change in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add those unique dependencies to the e2e2/test/images/nvidia/Dockerfile?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cartermckinnon I agree though that further thought needs to be put into the test directory structure before we go too much further. I'm not sure how many more tests we're looking to add, but the current approach doesn't scale particularly well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weicongw I mean sure we can, but then you're adding another ~7GB of deps to that image, which are totally unnecessary for the unit tests. Also, if we ever added the another test (e.g. ResNet), they very well could have their own unique dependencies as well. This will especially be true if we ever want to validate other frameworks than what are currently being utilized.

@mattcjo
Copy link
Contributor Author

mattcjo commented Jul 18, 2024

A few major updates were made with the last few commits. The GPU was incorrectly being assigned to the process' world rank and not its local rank. This was leading to a failure when trying to run on a multi-node cluster. I've rectified the problem, and the script will now successfully run on a multi-node cluster. Here's the output from a successful workload being ran on a cluster of 4 nodes, where each node is a p3.16xlarge instance (NOTE - EFA is not enabled for this instance type):

Click to expand
Warning: Permanently added 'bert-mpi-training-worker-0.bert-mpi-training.default.svc' (ED25519) to the list of known hosts.
Warning: Permanently added 'bert-mpi-training-worker-1.bert-mpi-training.default.svc' (ED25519) to the list of known hosts.
Warning: Permanently added 'bert-mpi-training-worker-3.bert-mpi-training.default.svc' (ED25519) to the list of known hosts.
Warning: Permanently added 'bert-mpi-training-worker-2.bert-mpi-training.default.svc' (ED25519) to the list of known hosts.
[1,1]<stdout>:Process started for rank 1 with local rank 1
[1,1]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,1]<stderr>:  warnings.warn(
[1,0]<stdout>:Process started for rank 0 with local rank 0
[1,0]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,0]<stderr>:  warnings.warn(
[1,4]<stdout>:Process started for rank 4 with local rank 4
[1,4]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,4]<stderr>:  warnings.warn(
[1,11]<stdout>:Process started for rank 11 with local rank 3
[1,11]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,11]<stderr>:  warnings.warn(
[1,13]<stdout>:Process started for rank 13 with local rank 5
[1,13]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,13]<stderr>:  warnings.warn(
[1,15]<stdout>:Process started for rank 15 with local rank 7
[1,15]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,15]<stderr>:  warnings.warn(
[1,8]<stdout>:Process started for rank 8 with local rank 0
[1,8]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,8]<stderr>:  warnings.warn(
[1,12]<stdout>:Process started for rank 12 with local rank 4
[1,9]<stdout>:Process started for rank 9 with local rank 1
[1,12]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,12]<stderr>:  warnings.warn(
[1,9]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,9]<stderr>:  warnings.warn(
[1,14]<stdout>:Process started for rank 14 with local rank 6
[1,14]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,14]<stderr>:  warnings.warn(
[1,3]<stdout>:Process started for rank 3 with local rank 3
[1,3]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,3]<stderr>:  warnings.warn(
[1,10]<stdout>:Process started for rank 10 with local rank 2
[1,10]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,10]<stderr>:  warnings.warn(
[1,28]<stdout>:Process started for rank 28 with local rank 4
[1,24]<stdout>:Process started for rank 24 with local rank 0
[1,25]<stdout>:Process started for rank 25 with local rank 1
[1,31]<stdout>:Process started for rank 31 with local rank 7
[1,24]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,24]<stderr>:  warnings.warn(
[1,28]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,28]<stderr>:  warnings.warn(
[1,25]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,25]<stderr>:  warnings.warn(
[1,31]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,31]<stderr>:  warnings.warn(
[1,29]<stdout>:Process started for rank 29 with local rank 5
[1,29]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,29]<stderr>:  warnings.warn(
[1,26]<stdout>:Process started for rank 26 with local rank 2
[1,26]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,26]<stderr>:  warnings.warn(
[1,2]<stdout>:Process started for rank 2 with local rank 2
[1,30]<stdout>:Process started for rank 30 with local rank 6
[1,2]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,2]<stderr>:  warnings.warn(
[1,30]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,30]<stderr>:  warnings.warn(
[1,21]<stdout>:Process started for rank 21 with local rank 5
[1,16]<stdout>:Process started for rank 16 with local rank 0
[1,17]<stdout>:Process started for rank 17 with local rank 1
[1,23]<stdout>:Process started for rank 23 with local rank 7
[1,16]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,16]<stderr>:  warnings.warn(
[1,21]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,21]<stderr>:  warnings.warn(
[1,17]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,17]<stderr>:  warnings.warn(
[1,23]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,23]<stderr>:  warnings.warn(
[1,19]<stdout>:Process started for rank 19 with local rank 3
[1,19]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,19]<stderr>:  warnings.warn(
[1,20]<stdout>:Process started for rank 20 with local rank 4
[1,20]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,20]<stderr>:  warnings.warn(
[1,27]<stdout>:Process started for rank 27 with local rank 3
[1,27]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,27]<stderr>:  warnings.warn(
[1,7]<stdout>:Process started for rank 7 with local rank 7
[1,7]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,7]<stderr>:  warnings.warn(
[1,18]<stdout>:Process started for rank 18 with local rank 2
[1,6]<stdout>:Process started for rank 6 with local rank 6
[1,18]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,18]<stderr>:  warnings.warn(
[1,6]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,6]<stderr>:  warnings.warn(
[1,5]<stdout>:Process started for rank 5 with local rank 5
[1,5]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,5]<stderr>:  warnings.warn(
[1,22]<stdout>:Process started for rank 22 with local rank 6
[1,22]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,22]<stderr>:  warnings.warn(
[1,25]<stdout>:successfully downloaded model and tokenizer for rank: 25
[1,29]<stdout>:successfully downloaded model and tokenizer for rank: 29
[1,30]<stdout>:successfully downloaded model and tokenizer for rank: 30
[1,27]<stdout>:successfully downloaded model and tokenizer for rank: 27
[1,31]<stdout>:successfully downloaded model and tokenizer for rank: 31
[1,28]<stdout>:successfully downloaded model and tokenizer for rank: 28
[1,26]<stdout>:successfully downloaded model and tokenizer for rank: 26
[1,24]<stdout>:successfully downloaded model and tokenizer for rank: 24
[1,12]<stdout>:successfully downloaded model and tokenizer for rank: 12
[1,8]<stdout>:successfully downloaded model and tokenizer for rank: 8
[1,11]<stdout>:successfully downloaded model and tokenizer for rank: 11
[1,10]<stdout>:successfully downloaded model and tokenizer for rank: 10
[1,15]<stdout>:successfully downloaded model and tokenizer for rank: 15
[1,9]<stdout>:successfully downloaded model and tokenizer for rank: 9
[1,0]<stdout>:successfully downloaded model and tokenizer for rank: 0
[1,14]<stdout>:successfully downloaded model and tokenizer for rank: 14
[1,1]<stdout>:successfully downloaded model and tokenizer for rank: 1
[1,4]<stdout>:successfully downloaded model and tokenizer for rank: 4
[1,5]<stdout>:successfully downloaded model and tokenizer for rank: 5
[1,2]<stdout>:successfully downloaded model and tokenizer for rank: 2
[1,7]<stdout>:successfully downloaded model and tokenizer for rank: 7
[1,3]<stdout>:successfully downloaded model and tokenizer for rank: 3
[1,13]<stdout>:successfully downloaded model and tokenizer for rank: 13
[1,6]<stdout>:successfully downloaded model and tokenizer for rank: 6
[1,19]<stdout>:successfully downloaded model and tokenizer for rank: 19
[1,16]<stdout>:successfully downloaded model and tokenizer for rank: 16
[1,20]<stdout>:successfully downloaded model and tokenizer for rank: 20
[1,18]<stdout>:successfully downloaded model and tokenizer for rank: 18
[1,22]<stdout>:successfully downloaded model and tokenizer for rank: 22
[1,17]<stdout>:successfully downloaded model and tokenizer for rank: 17
[1,21]<stdout>:successfully downloaded model and tokenizer for rank: 21
[1,23]<stdout>:successfully downloaded model and tokenizer for rank: 23
[1,8]<stdout>:Process 8 initialized, using GPU 0
[1,16]<stdout>:Process 16 initialized, using GPU 0
[1,15]<stdout>:Process 15 initialized, using GPU 7
[1,9]<stdout>:Process 9 initialized, using GPU 1
[1,13]<stdout>:Process 13 initialized, using GPU 5
[1,14]<stdout>:Process 14 initialized, using GPU 6
[1,1]<stdout>:Process 1 initialized, using GPU 1
[1,7]<stdout>:Process 7 initialized, using GPU 7
[1,3]<stdout>:Process 3 initialized, using GPU 3
[1,19]<stdout>:Process 19 initialized, using GPU 3
[1,6]<stdout>:Process 6 initialized, using GPU 6
[1,4]<stdout>:Process 4 initialized, using GPU 4
[1,5]<stdout>:Process 5 initialized, using GPU 5
[1,2]<stdout>:Process 2 initialized, using GPU 2
[1,24]<stdout>:Process 24 initialized, using GPU 0
[1,0]<stdout>:Process 0 initialized, using GPU 0
[1,23]<stdout>:Process 23 initialized, using GPU 7
[1,18]<stdout>:Process 18 initialized, using GPU 2
[1,21]<stdout>:Process 21 initialized, using GPU 5
[1,17]<stdout>:Process 17 initialized, using GPU 1
[1,20]<stdout>:Process 20 initialized, using GPU 4
[1,22]<stdout>:Process 22 initialized, using GPU 6
[1,11]<stdout>:Process 11 initialized, using GPU 3
[1,10]<stdout>:Process 10 initialized, using GPU 2
[1,12]<stdout>:Process 12 initialized, using GPU 4
[1,0]<stdout>:bert-mpi-training-worker-0:21:21 [0] NCCL INFO Bootstrap : Using eth0:192.168.29.226<0>
[1,0]<stdout>:bert-mpi-training-worker-0:21:21 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,0]<stdout>:bert-mpi-training-worker-0:21:21 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,0]<stdout>:bert-mpi-training-worker-0:21:21 [0] NCCL INFO cudaDriverVersion 12050
[1,0]<stdout>:NCCL version 2.20.5+cuda12.4
[1,7]<stdout>:bert-mpi-training-worker-0:28:28 [7] NCCL INFO cudaDriverVersion 12050
[1,7]<stdout>:bert-mpi-training-worker-0:28:28 [7] NCCL INFO Bootstrap : Using eth0:192.168.29.226<0>
[1,5]<stdout>:bert-mpi-training-worker-0:26:26 [5] NCCL INFO cudaDriverVersion 12050
[1,1]<stdout>:bert-mpi-training-worker-0:22:22 [1] NCCL INFO cudaDriverVersion 12050
[1,4]<stdout>:bert-mpi-training-worker-0:25:25 [4] NCCL INFO cudaDriverVersion 12050
[1,2]<stdout>:bert-mpi-training-worker-0:23:23 [2] NCCL INFO cudaDriverVersion 12050
[1,3]<stdout>:bert-mpi-training-worker-0:24:24 [3] NCCL INFO cudaDriverVersion 12050
[1,6]<stdout>:bert-mpi-training-worker-0:27:27 [6] NCCL INFO cudaDriverVersion 12050
[1,1]<stdout>:bert-mpi-training-worker-0:22:22 [1] NCCL INFO Bootstrap : Using eth0:192.168.29.226<0>
[1,5]<stdout>:bert-mpi-training-worker-0:26:26 [5] NCCL INFO Bootstrap : Using eth0:192.168.29.226<0>
[1,6]<stdout>:bert-mpi-training-worker-0:27:27 [6] NCCL INFO Bootstrap : Using eth0:192.168.29.226<0>
[1,2]<stdout>:bert-mpi-training-worker-0:23:23 [2] NCCL INFO Bootstrap : Using eth0:192.168.29.226<0>
[1,3]<stdout>:bert-mpi-training-worker-0:24:24 [3] NCCL INFO Bootstrap : Using eth0:192.168.29.226<0>
[1,4]<stdout>:bert-mpi-training-worker-0:25:25 [4] NCCL INFO Bootstrap : Using eth0:192.168.29.226<0>
[1,7]<stdout>:bert-mpi-training-worker-0:28:28 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,7]<stdout>:bert-mpi-training-worker-0:28:28 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,1]<stdout>:bert-mpi-training-worker-0:22:22 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,1]<stdout>:bert-mpi-training-worker-0:22:22 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,5]<stdout>:bert-mpi-training-worker-0:26:26 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,5]<stdout>:bert-mpi-training-worker-0:26:26 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,2]<stdout>:bert-mpi-training-worker-0:23:23 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,2]<stdout>:bert-mpi-training-worker-0:23:23 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,6]<stdout>:bert-mpi-training-worker-0:27:27 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,6]<stdout>:bert-mpi-training-worker-0:27:27 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,4]<stdout>:bert-mpi-training-worker-0:25:25 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,4]<stdout>:bert-mpi-training-worker-0:25:25 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,3]<stdout>:bert-mpi-training-worker-0:24:24 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,3]<stdout>:bert-mpi-training-worker-0:24:24 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,15]<stdout>:bert-mpi-training-worker-1:28:28 [7] NCCL INFO cudaDriverVersion 12050
[1,8]<stdout>:bert-mpi-training-worker-1:21:21 [0] NCCL INFO cudaDriverVersion 12050
[1,15]<stdout>:bert-mpi-training-worker-1:28:28 [7] NCCL INFO Bootstrap : Using eth0:192.168.60.235<0>
[1,13]<stdout>:bert-mpi-training-worker-1:26:26 [5] NCCL INFO cudaDriverVersion 12050
[1,9]<stdout>:bert-mpi-training-worker-1:22:22 [1] NCCL INFO cudaDriverVersion 12050
[1,14]<stdout>:bert-mpi-training-worker-1:27:27 [6] NCCL INFO cudaDriverVersion 12050
[1,13]<stdout>:bert-mpi-training-worker-1:26:26 [5] NCCL INFO Bootstrap : Using eth0:192.168.60.235<0>
[1,8]<stdout>:bert-mpi-training-worker-1:21:21 [0] NCCL INFO Bootstrap : Using eth0:192.168.60.235<0>
[1,9]<stdout>:bert-mpi-training-worker-1:22:22 [1] NCCL INFO Bootstrap : Using eth0:192.168.60.235<0>
[1,14]<stdout>:bert-mpi-training-worker-1:27:27 [6] NCCL INFO Bootstrap : Using eth0:192.168.60.235<0>
[1,12]<stdout>:bert-mpi-training-worker-1:25:25 [4] NCCL INFO cudaDriverVersion 12050
[1,12]<stdout>:bert-mpi-training-worker-1:25:25 [4] NCCL INFO Bootstrap : Using eth0:192.168.60.235<0>
[1,15]<stdout>:bert-mpi-training-worker-1:28:28 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,15]<stdout>:bert-mpi-training-worker-1:28:28 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,8]<stdout>:bert-mpi-training-worker-1:21:21 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,8]<stdout>:bert-mpi-training-worker-1:21:21 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,13]<stdout>:bert-mpi-training-worker-1:26:26 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,13]<stdout>:bert-mpi-training-worker-1:26:26 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,9]<stdout>:bert-mpi-training-worker-1:22:22 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,14]<stdout>:bert-mpi-training-worker-1:27:27 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,14]<stdout>:bert-mpi-training-worker-1:27:27 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,9]<stdout>:bert-mpi-training-worker-1:22:22 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,12]<stdout>:bert-mpi-training-worker-1:25:25 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,12]<stdout>:bert-mpi-training-worker-1:25:25 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,11]<stdout>:bert-mpi-training-worker-1:24:24 [3] NCCL INFO cudaDriverVersion 12050
[1,11]<stdout>:bert-mpi-training-worker-1:24:24 [3] NCCL INFO Bootstrap : Using eth0:192.168.60.235<0>
[1,18]<stdout>:bert-mpi-training-worker-2:23:23 [2] NCCL INFO cudaDriverVersion 12050
[1,16]<stdout>:bert-mpi-training-worker-2:21:21 [0] NCCL INFO cudaDriverVersion 12050
[1,21]<stdout>:bert-mpi-training-worker-2:26:26 [5] NCCL INFO cudaDriverVersion 12050
[1,17]<stdout>:bert-mpi-training-worker-2:22:22 [1] NCCL INFO cudaDriverVersion 12050
[1,23]<stdout>:bert-mpi-training-worker-2:28:28 [7] NCCL INFO cudaDriverVersion 12050
[1,20]<stdout>:bert-mpi-training-worker-2:25:25 [4] NCCL INFO cudaDriverVersion 12050
[1,22]<stdout>:bert-mpi-training-worker-2:27:27 [6] NCCL INFO cudaDriverVersion 12050
[1,19]<stdout>:bert-mpi-training-worker-2:24:24 [3] NCCL INFO cudaDriverVersion 12050
[1,11]<stdout>:bert-mpi-training-worker-1:24:24 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,11]<stdout>:bert-mpi-training-worker-1:24:24 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,16]<stdout>:bert-mpi-training-worker-2:21:21 [0] NCCL INFO Bootstrap : Using eth0:192.168.77.153<0>
[1,18]<stdout>:bert-mpi-training-worker-2:23:23 [2] NCCL INFO Bootstrap : Using eth0:192.168.77.153<0>
[1,21]<stdout>:bert-mpi-training-worker-2:26:26 [5] NCCL INFO Bootstrap : Using eth0:192.168.77.153<0>
[1,17]<stdout>:bert-mpi-training-worker-2:22:22 [1] NCCL INFO Bootstrap : Using eth0:192.168.77.153<0>
[1,23]<stdout>:bert-mpi-training-worker-2:28:28 [7] NCCL INFO Bootstrap : Using eth0:192.168.77.153<0>
[1,22]<stdout>:bert-mpi-training-worker-2:27:27 [6] NCCL INFO Bootstrap : Using eth0:192.168.77.153<0>
[1,20]<stdout>:bert-mpi-training-worker-2:25:25 [4] NCCL INFO Bootstrap : Using eth0:192.168.77.153<0>
[1,19]<stdout>:bert-mpi-training-worker-2:24:24 [3] NCCL INFO Bootstrap : Using eth0:192.168.77.153<0>
[1,22]<stdout>:bert-mpi-training-worker-2:27:27 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,22]<stdout>:bert-mpi-training-worker-2:27:27 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,16]<stdout>:bert-mpi-training-worker-2:21:21 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,16]<stdout>:bert-mpi-training-worker-2:21:21 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,18]<stdout>:bert-mpi-training-worker-2:23:23 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,18]<stdout>:bert-mpi-training-worker-2:23:23 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,23]<stdout>:bert-mpi-training-worker-2:28:28 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,23]<stdout>:bert-mpi-training-worker-2:28:28 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,21]<stdout>:bert-mpi-training-worker-2:26:26 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,21]<stdout>:bert-mpi-training-worker-2:26:26 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,17]<stdout>:bert-mpi-training-worker-2:22:22 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,17]<stdout>:bert-mpi-training-worker-2:22:22 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,20]<stdout>:bert-mpi-training-worker-2:25:25 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,20]<stdout>:bert-mpi-training-worker-2:25:25 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,19]<stdout>:bert-mpi-training-worker-2:24:24 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,19]<stdout>:bert-mpi-training-worker-2:24:24 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,10]<stdout>:bert-mpi-training-worker-1:23:23 [2] NCCL INFO cudaDriverVersion 12050
[1,10]<stdout>:bert-mpi-training-worker-1:23:23 [2] NCCL INFO Bootstrap : Using eth0:192.168.60.235<0>
[1,10]<stdout>:bert-mpi-training-worker-1:23:23 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,10]<stdout>:bert-mpi-training-worker-1:23:23 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO NET/OFI Configuring AWS-specific options
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO NET/OFI Setting provider_filter to efa
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,0]<stdout>:
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,0]<stdout>:
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO NET/IB : No device found.
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.29.226<0>
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Using non-device net plugin version 0
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Using network Socket
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO NET/OFI Configuring AWS-specific options
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO NET/OFI Setting provider_filter to efa
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO NET/OFI Configuring AWS-specific options
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO NET/OFI Setting provider_filter to efa
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO NET/OFI Configuring AWS-specific options
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO NET/OFI Setting provider_filter to efa
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO NET/OFI Configuring AWS-specific options
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO NET/OFI Setting provider_filter to efa
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO NET/OFI Configuring AWS-specific options
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO NET/OFI Setting provider_filter to efa
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO NET/OFI Configuring AWS-specific options
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO NET/OFI Setting provider_filter to efa
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO NET/OFI Configuring AWS-specific options
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO NET/OFI Setting provider_filter to efa
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO NET/OFI Configuring AWS-specific options
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO NET/OFI Configuring AWS-specific options
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO NET/OFI Setting provider_filter to efa
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO NET/OFI Setting provider_filter to efa
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO NET/OFI Configuring AWS-specific options
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO NET/OFI Setting provider_filter to efa
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO NET/OFI Configuring AWS-specific options
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO NET/OFI Setting provider_filter to efa
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO NET/OFI Configuring AWS-specific options
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO NET/OFI Setting provider_filter to efa
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO NET/OFI Configuring AWS-specific options
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO NET/OFI Setting provider_filter to efa
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO NET/OFI Configuring AWS-specific options
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO NET/OFI Setting provider_filter to efa
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO NET/OFI Configuring AWS-specific options
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO NET/OFI Setting provider_filter to efa
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO NET/OFI Configuring AWS-specific options
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO NET/OFI Setting provider_filter to efa
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO NET/OFI Configuring AWS-specific options
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO NET/OFI Setting provider_filter to efa
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO NET/OFI Configuring AWS-specific options
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO NET/OFI Setting provider_filter to efa
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO NET/OFI Configuring AWS-specific options
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO NET/OFI Setting provider_filter to efa
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO NET/OFI Configuring AWS-specific options
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO NET/OFI Setting provider_filter to efa
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO NET/OFI Configuring AWS-specific options
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO NET/OFI Setting provider_filter to efa
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO NET/OFI Configuring AWS-specific options
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO NET/OFI Setting provider_filter to efa
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO NET/OFI Configuring AWS-specific options
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO NET/OFI Setting provider_filter to efa
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,7]<stdout>:
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,7]<stdout>:
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO NET/IB : No device found.
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO NET/Socket : Using [0]eth0:192.168.29.226<0>
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO Using non-device net plugin version 0
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO Using network Socket
[1,4]<stdout>:
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,4]<stdout>:
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO NET/IB : No device found.
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO NET/Socket : Using [0]eth0:192.168.29.226<0>
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO Using non-device net plugin version 0
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO Using network Socket
[1,1]<stdout>:
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,1]<stdout>:
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO NET/IB : No device found.
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO NET/Socket : Using [0]eth0:192.168.29.226<0>
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO Using non-device net plugin version 0
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO Using network Socket
[1,3]<stdout>:
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,3]<stdout>:
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO NET/IB : No device found.
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO NET/Socket : Using [0]eth0:192.168.29.226<0>
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO Using non-device net plugin version 0
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO Using network Socket
[1,6]<stdout>:
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,6]<stdout>:
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO NET/IB : No device found.
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO NET/Socket : Using [0]eth0:192.168.29.226<0>
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO Using non-device net plugin version 0
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO Using network Socket
[1,15]<stdout>:
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,15]<stdout>:
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,23]<stdout>:
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,23]<stdout>:
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO NET/IB : No device found.
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO NET/IB : No device found.
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO NET/Socket : Using [0]eth0:192.168.60.235<0>
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO Using non-device net plugin version 0
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO Using network Socket
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO NET/Socket : Using [0]eth0:192.168.77.153<0>
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO Using non-device net plugin version 0
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO Using network Socket
[1,16]<stdout>:
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,16]<stdout>:
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO NET/IB : No device found.
[1,2]<stdout>:
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,2]<stdout>:
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.77.153<0>
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Using non-device net plugin version 0
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Using network Socket
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO NET/IB : No device found.
[1,8]<stdout>:
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,8]<stdout>:
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO NET/Socket : Using [0]eth0:192.168.29.226<0>
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO Using non-device net plugin version 0
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO Using network Socket
[1,9]<stdout>:
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,9]<stdout>:
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO NET/IB : No device found.
[1,10]<stdout>:
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,10]<stdout>:
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO NET/IB : No device found.
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.60.235<0>
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Using non-device net plugin version 0
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Using network Socket
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO NET/Socket : Using [0]eth0:192.168.60.235<0>
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO Using non-device net plugin version 0
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO Using network Socket
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO NET/IB : No device found.
[1,5]<stdout>:
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,5]<stdout>:
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO NET/Socket : Using [0]eth0:192.168.60.235<0>
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO Using non-device net plugin version 0
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO Using network Socket
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO NET/IB : No device found.
[1,11]<stdout>:
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,11]<stdout>:
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,19]<stdout>:
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,19]<stdout>:
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO NET/Socket : Using [0]eth0:192.168.29.226<0>
[1,13]<stdout>:
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,13]<stdout>:
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO Using non-device net plugin version 0
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO Using network Socket
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO NET/IB : No device found.
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO NET/IB : No device found.
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO NET/IB : No device found.
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO NET/Socket : Using [0]eth0:192.168.77.153<0>
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO Using non-device net plugin version 0
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO Using network Socket
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO NET/Socket : Using [0]eth0:192.168.60.235<0>
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO NET/Socket : Using [0]eth0:192.168.60.235<0>
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO Using non-device net plugin version 0
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO Using network Socket
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO Using non-device net plugin version 0
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO Using network Socket
[1,14]<stdout>:
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,14]<stdout>:
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO NET/IB : No device found.
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO NET/Socket : Using [0]eth0:192.168.60.235<0>
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO Using non-device net plugin version 0
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO Using network Socket
[1,12]<stdout>:
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,12]<stdout>:
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO NET/IB : No device found.
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO NET/Socket : Using [0]eth0:192.168.60.235<0>
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO Using non-device net plugin version 0
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO Using network Socket
[1,18]<stdout>:
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,18]<stdout>:
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO NET/IB : No device found.
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO NET/Socket : Using [0]eth0:192.168.77.153<0>
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO Using non-device net plugin version 0
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO Using network Socket
[1,17]<stdout>:
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,17]<stdout>:
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO NET/IB : No device found.
[1,20]<stdout>:
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,20]<stdout>:
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO NET/Socket : Using [0]eth0:192.168.77.153<0>
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO Using non-device net plugin version 0
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO Using network Socket
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO NET/IB : No device found.
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO NET/Socket : Using [0]eth0:192.168.77.153<0>
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO Using non-device net plugin version 0
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO Using network Socket
[1,22]<stdout>:
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,22]<stdout>:
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO NET/IB : No device found.
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO NET/Socket : Using [0]eth0:192.168.77.153<0>
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO Using non-device net plugin version 0
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO Using network Socket
[1,21]<stdout>:
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,21]<stdout>:
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO NET/IB : No device found.
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO NET/Socket : Using [0]eth0:192.168.77.153<0>
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO Using non-device net plugin version 0
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO Using network Socket
[1,29]<stdout>:Process 29 initialized, using GPU 5
[1,25]<stdout>:Process 25 initialized, using GPU 1
[1,31]<stdout>:Process 31 initialized, using GPU 7
[1,30]<stdout>:Process 30 initialized, using GPU 6
[1,28]<stdout>:Process 28 initialized, using GPU 4
[1,26]<stdout>:Process 26 initialized, using GPU 2
[1,27]<stdout>:Process 27 initialized, using GPU 3
[1,29]<stdout>:bert-mpi-training-worker-3:26:26 [5] NCCL INFO cudaDriverVersion 12050
[1,29]<stdout>:bert-mpi-training-worker-3:26:26 [5] NCCL INFO Bootstrap : Using eth0:192.168.45.173<0>
[1,25]<stdout>:bert-mpi-training-worker-3:22:22 [1] NCCL INFO cudaDriverVersion 12050
[1,29]<stdout>:bert-mpi-training-worker-3:26:26 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,29]<stdout>:bert-mpi-training-worker-3:26:26 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,25]<stdout>:bert-mpi-training-worker-3:22:22 [1] NCCL INFO Bootstrap : Using eth0:192.168.45.173<0>
[1,25]<stdout>:bert-mpi-training-worker-3:22:22 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,25]<stdout>:bert-mpi-training-worker-3:22:22 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,30]<stdout>:bert-mpi-training-worker-3:27:27 [6] NCCL INFO cudaDriverVersion 12050
[1,30]<stdout>:bert-mpi-training-worker-3:27:27 [6] NCCL INFO Bootstrap : Using eth0:192.168.45.173<0>
[1,30]<stdout>:bert-mpi-training-worker-3:27:27 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,30]<stdout>:bert-mpi-training-worker-3:27:27 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,31]<stdout>:bert-mpi-training-worker-3:28:28 [7] NCCL INFO cudaDriverVersion 12050
[1,31]<stdout>:bert-mpi-training-worker-3:28:28 [7] NCCL INFO Bootstrap : Using eth0:192.168.45.173<0>
[1,31]<stdout>:bert-mpi-training-worker-3:28:28 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,31]<stdout>:bert-mpi-training-worker-3:28:28 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,24]<stdout>:bert-mpi-training-worker-3:21:21 [0] NCCL INFO cudaDriverVersion 12050
[1,24]<stdout>:bert-mpi-training-worker-3:21:21 [0] NCCL INFO Bootstrap : Using eth0:192.168.45.173<0>
[1,24]<stdout>:bert-mpi-training-worker-3:21:21 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,24]<stdout>:bert-mpi-training-worker-3:21:21 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,28]<stdout>:bert-mpi-training-worker-3:25:25 [4] NCCL INFO cudaDriverVersion 12050
[1,28]<stdout>:bert-mpi-training-worker-3:25:25 [4] NCCL INFO Bootstrap : Using eth0:192.168.45.173<0>
[1,28]<stdout>:bert-mpi-training-worker-3:25:25 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,28]<stdout>:bert-mpi-training-worker-3:25:25 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,27]<stdout>:bert-mpi-training-worker-3:24:24 [3] NCCL INFO cudaDriverVersion 12050
[1,27]<stdout>:bert-mpi-training-worker-3:24:24 [3] NCCL INFO Bootstrap : Using eth0:192.168.45.173<0>
[1,27]<stdout>:bert-mpi-training-worker-3:24:24 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,27]<stdout>:bert-mpi-training-worker-3:24:24 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,26]<stdout>:bert-mpi-training-worker-3:23:23 [2] NCCL INFO cudaDriverVersion 12050
[1,26]<stdout>:bert-mpi-training-worker-3:23:23 [2] NCCL INFO Bootstrap : Using eth0:192.168.45.173<0>
[1,26]<stdout>:bert-mpi-training-worker-3:23:23 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,26]<stdout>:bert-mpi-training-worker-3:23:23 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO NET/OFI Configuring AWS-specific options
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO NET/OFI Setting provider_filter to efa
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO NET/OFI Configuring AWS-specific options
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO NET/OFI Setting provider_filter to efa
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,29]<stdout>:
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,29]<stdout>:
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO NET/IB : No device found.
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO NET/Socket : Using [0]eth0:192.168.45.173<0>
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO Using non-device net plugin version 0
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO Using network Socket
[1,25]<stdout>:
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,25]<stdout>:
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO NET/IB : No device found.
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO NET/Socket : Using [0]eth0:192.168.45.173<0>
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO Using non-device net plugin version 0
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO Using network Socket
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO NET/OFI Configuring AWS-specific options
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO NET/OFI Setting provider_filter to efa
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO NET/OFI Configuring AWS-specific options
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO NET/OFI Setting provider_filter to efa
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,30]<stdout>:
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,30]<stdout>:
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO NET/IB : No device found.
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO NET/Socket : Using [0]eth0:192.168.45.173<0>
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO Using non-device net plugin version 0
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO Using network Socket
[1,31]<stdout>:
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,31]<stdout>:
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO NET/IB : No device found.
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO NET/Socket : Using [0]eth0:192.168.45.173<0>
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO Using non-device net plugin version 0
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO Using network Socket
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO NET/OFI Configuring AWS-specific options
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO NET/OFI Setting provider_filter to efa
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,24]<stdout>:
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,24]<stdout>:
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO NET/OFI Configuring AWS-specific options
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO NET/OFI Setting provider_filter to efa
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO NET/IB : No device found.
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.45.173<0>
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO Using non-device net plugin version 0
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO Using network Socket
[1,28]<stdout>:
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,28]<stdout>:
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO NET/OFI Configuring AWS-specific options
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO NET/OFI Setting provider_filter to efa
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO NET/IB : No device found.
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO NET/OFI Configuring AWS-specific options
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO NET/OFI Setting provider_filter to efa
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO NET/Socket : Using [0]eth0:192.168.45.173<0>
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO Using non-device net plugin version 0
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO Using network Socket
[1,27]<stdout>:
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,27]<stdout>:
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO NET/IB : No device found.
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO NET/Socket : Using [0]eth0:192.168.45.173<0>
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO Using non-device net plugin version 0
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO Using network Socket
[1,26]<stdout>:
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,26]<stdout>:
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO NET/IB : No device found.
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO NET/Socket : Using [0]eth0:192.168.45.173<0>
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO Using non-device net plugin version 0
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO Using network Socket
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO comm 0x5653dbf7d100 rank 2 nranks 32 cudaDev 2 nvmlDev 2 busId 190 commId 0x837dd0976e1b4338 - Init START
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO comm 0x5582a4a63380 rank 23 nranks 32 cudaDev 7 nvmlDev 7 busId 1e0 commId 0x837dd0976e1b4338 - Init START
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO comm 0x5648da365f80 rank 22 nranks 32 cudaDev 6 nvmlDev 6 busId 1d0 commId 0x837dd0976e1b4338 - Init START
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO comm 0x55986c6b8ac0 rank 3 nranks 32 cudaDev 3 nvmlDev 3 busId 1a0 commId 0x837dd0976e1b4338 - Init START
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO comm 0x5640d3700dc0 rank 6 nranks 32 cudaDev 6 nvmlDev 6 busId 1d0 commId 0x837dd0976e1b4338 - Init START
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO comm 0x5641c3f88e40 rank 4 nranks 32 cudaDev 4 nvmlDev 4 busId 1b0 commId 0x837dd0976e1b4338 - Init START
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO comm 0x564c64f91780 rank 5 nranks 32 cudaDev 5 nvmlDev 5 busId 1c0 commId 0x837dd0976e1b4338 - Init START
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO comm 0x563f66c76d80 rank 7 nranks 32 cudaDev 7 nvmlDev 7 busId 1e0 commId 0x837dd0976e1b4338 - Init START
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO comm 0x55bc31870680 rank 0 nranks 32 cudaDev 0 nvmlDev 0 busId 170 commId 0x837dd0976e1b4338 - Init START
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO comm 0x5613f202fdc0 rank 9 nranks 32 cudaDev 1 nvmlDev 1 busId 180 commId 0x837dd0976e1b4338 - Init START
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO comm 0x55bdd0698c80 rank 28 nranks 32 cudaDev 4 nvmlDev 4 busId 1b0 commId 0x837dd0976e1b4338 - Init START
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO comm 0x559fe7380640 rank 1 nranks 32 cudaDev 1 nvmlDev 1 busId 180 commId 0x837dd0976e1b4338 - Init START
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO comm 0x55600fb9c840 rank 27 nranks 32 cudaDev 3 nvmlDev 3 busId 1a0 commId 0x837dd0976e1b4338 - Init START
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO comm 0x558a2e4d59c0 rank 31 nranks 32 cudaDev 7 nvmlDev 7 busId 1e0 commId 0x837dd0976e1b4338 - Init START
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO comm 0x55768c2aaf40 rank 29 nranks 32 cudaDev 5 nvmlDev 5 busId 1c0 commId 0x837dd0976e1b4338 - Init START
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO comm 0x55b4b90f7540 rank 25 nranks 32 cudaDev 1 nvmlDev 1 busId 180 commId 0x837dd0976e1b4338 - Init START
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO comm 0x556aae88b300 rank 24 nranks 32 cudaDev 0 nvmlDev 0 busId 170 commId 0x837dd0976e1b4338 - Init START
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO comm 0x5609cdbbff40 rank 30 nranks 32 cudaDev 6 nvmlDev 6 busId 1d0 commId 0x837dd0976e1b4338 - Init START
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO comm 0x55adee232c40 rank 26 nranks 32 cudaDev 2 nvmlDev 2 busId 190 commId 0x837dd0976e1b4338 - Init START
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO comm 0x5591c7c2f340 rank 13 nranks 32 cudaDev 5 nvmlDev 5 busId 1c0 commId 0x837dd0976e1b4338 - Init START
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO comm 0x560b8538e2c0 rank 10 nranks 32 cudaDev 2 nvmlDev 2 busId 190 commId 0x837dd0976e1b4338 - Init START
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO comm 0x55e5f0da8fc0 rank 14 nranks 32 cudaDev 6 nvmlDev 6 busId 1d0 commId 0x837dd0976e1b4338 - Init START
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO comm 0x55833c92f9c0 rank 15 nranks 32 cudaDev 7 nvmlDev 7 busId 1e0 commId 0x837dd0976e1b4338 - Init START
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO comm 0x564656b04380 rank 11 nranks 32 cudaDev 3 nvmlDev 3 busId 1a0 commId 0x837dd0976e1b4338 - Init START
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO comm 0x557b453cac40 rank 12 nranks 32 cudaDev 4 nvmlDev 4 busId 1b0 commId 0x837dd0976e1b4338 - Init START
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO comm 0x5648e3019c80 rank 8 nranks 32 cudaDev 0 nvmlDev 0 busId 170 commId 0x837dd0976e1b4338 - Init START
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO comm 0x5571768f15c0 rank 21 nranks 32 cudaDev 5 nvmlDev 5 busId 1c0 commId 0x837dd0976e1b4338 - Init START
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO comm 0x559fbed3b980 rank 20 nranks 32 cudaDev 4 nvmlDev 4 busId 1b0 commId 0x837dd0976e1b4338 - Init START
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO comm 0x5617488778d0 rank 19 nranks 32 cudaDev 3 nvmlDev 3 busId 1a0 commId 0x837dd0976e1b4338 - Init START
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO comm 0x565192948c00 rank 18 nranks 32 cudaDev 2 nvmlDev 2 busId 190 commId 0x837dd0976e1b4338 - Init START
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO comm 0x55e178f38c80 rank 17 nranks 32 cudaDev 1 nvmlDev 1 busId 180 commId 0x837dd0976e1b4338 - Init START
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO comm 0x556941882cc0 rank 16 nranks 32 cudaDev 0 nvmlDev 0 busId 170 commId 0x837dd0976e1b4338 - Init START
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO NVLS multicast support is not available on dev 4
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO NVLS multicast support is not available on dev 5
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO NVLS multicast support is not available on dev 7
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO NVLS multicast support is not available on dev 0
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO NVLS multicast support is not available on dev 5
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO NVLS multicast support is not available on dev 5
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO NVLS multicast support is not available on dev 6
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO NVLS multicast support is not available on dev 5
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO NVLS multicast support is not available on dev 1
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO NVLS multicast support is not available on dev 0
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO NVLS multicast support is not available on dev 2
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO NVLS multicast support is not available on dev 4
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO NVLS multicast support is not available on dev 6
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO NVLS multicast support is not available on dev 3
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO NVLS multicast support is not available on dev 3
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO NVLS multicast support is not available on dev 6
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO NVLS multicast support is not available on dev 6
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO NVLS multicast support is not available on dev 7
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO NVLS multicast support is not available on dev 1
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO NVLS multicast support is not available on dev 2
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO NVLS multicast support is not available on dev 2
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO NVLS multicast support is not available on dev 7
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO NVLS multicast support is not available on dev 3
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO NVLS multicast support is not available on dev 3
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO NVLS multicast support is not available on dev 1
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO NVLS multicast support is not available on dev 0
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO NVLS multicast support is not available on dev 2
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO NVLS multicast support is not available on dev 1
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO NVLS multicast support is not available on dev 0
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO NVLS multicast support is not available on dev 4
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO NVLS multicast support is not available on dev 4
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO NVLS multicast support is not available on dev 7
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO comm 0x559fbed3b980 rank 20 nRanks 32 nNodes 4 localRanks 8 localRank 4 MNNVL 0
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO comm 0x5640d3700dc0 rank 6 nRanks 32 nNodes 4 localRanks 8 localRank 6 MNNVL 0
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO comm 0x5648da365f80 rank 22 nRanks 32 nNodes 4 localRanks 8 localRank 6 MNNVL 0
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO Trees [0] 23/-1/-1->22->21 [1] 23/-1/-1->22->21
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO P2P Chunksize set to 131072
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO comm 0x5571768f15c0 rank 21 nRanks 32 nNodes 4 localRanks 8 localRank 5 MNNVL 0
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO Trees [0] 22/-1/-1->21->17 [1] 22/-1/-1->21->17
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO P2P Chunksize set to 131072
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO Trees [0] -1/-1/-1->20->23 [1] -1/-1/-1->20->23
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO P2P Chunksize set to 131072
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO comm 0x55768c2aaf40 rank 29 nRanks 32 nNodes 4 localRanks 8 localRank 5 MNNVL 0
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO comm 0x5582a4a63380 rank 23 nRanks 32 nNodes 4 localRanks 8 localRank 7 MNNVL 0
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO comm 0x563f66c76d80 rank 7 nRanks 32 nNodes 4 localRanks 8 localRank 7 MNNVL 0
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO Trees [0] 4/-1/-1->7->6 [1] 4/-1/-1->7->6
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO P2P Chunksize set to 131072
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO comm 0x55833c92f9c0 rank 15 nRanks 32 nNodes 4 localRanks 8 localRank 7 MNNVL 0
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO Trees [0] 20/-1/-1->23->22 [1] 20/-1/-1->23->22
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO P2P Chunksize set to 131072
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO P2P Chunksize set to 131072
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO Trees [0] 12/-1/-1->15->14 [1] 12/-1/-1->15->14
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO P2P Chunksize set to 131072
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO comm 0x5591c7c2f340 rank 13 nRanks 32 nNodes 4 localRanks 8 localRank 5 MNNVL 0
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO comm 0x5609cdbbff40 rank 30 nRanks 32 nNodes 4 localRanks 8 localRank 6 MNNVL 0
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO Trees [0] 31/-1/-1->30->29 [1] 31/-1/-1->30->29
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO comm 0x55e5f0da8fc0 rank 14 nRanks 32 nNodes 4 localRanks 8 localRank 6 MNNVL 0
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO comm 0x558a2e4d59c0 rank 31 nRanks 32 nNodes 4 localRanks 8 localRank 7 MNNVL 0
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO Trees [0] 28/-1/-1->31->30 [1] 28/-1/-1->31->30
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO comm 0x5653dbf7d100 rank 2 nRanks 32 nNodes 4 localRanks 8 localRank 2 MNNVL 0
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO Trees [0] 1/-1/-1->2->3 [1] 1/-1/-1->2->3
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO P2P Chunksize set to 131072
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO comm 0x564c64f91780 rank 5 nRanks 32 nNodes 4 localRanks 8 localRank 5 MNNVL 0
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO Trees [0] 6/-1/-1->5->1 [1] 6/-1/-1->5->1
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO P2P Chunksize set to 131072
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO Trees [0] 15/-1/-1->14->13 [1] 15/-1/-1->14->13
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO P2P Chunksize set to 131072
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO comm 0x55bdd0698c80 rank 28 nRanks 32 nNodes 4 localRanks 8 localRank 4 MNNVL 0
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO comm 0x559fe7380640 rank 1 nRanks 32 nNodes 4 localRanks 8 localRank 1 MNNVL 0
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO Trees [0] 5/-1/-1->1->2 [1] 5/-1/-1->1->2
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO P2P Chunksize set to 131072
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO comm 0x556aae88b300 rank 24 nRanks 32 nNodes 4 localRanks 8 localRank 0 MNNVL 0
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO comm 0x55bc31870680 rank 0 nRanks 32 nNodes 4 localRanks 8 localRank 0 MNNVL 0
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Channel 00/02 :    0   3   2   1   5   6   7   4   8  11  10   9  13  14  15  12  16  19  18  17
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Channel 01/02 :    0   3   2   1   5   6   7   4   8  11  10   9  13  14  15  12  16  19  18  17
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Trees [0] 3/16/-1->0->-1 [1] 3/-1/-1->0->8
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO P2P Chunksize set to 131072
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO Trees [0] 30/-1/-1->29->25 [1] 30/-1/-1->29->25
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO P2P Chunksize set to 131072
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO comm 0x55986c6b8ac0 rank 3 nRanks 32 nNodes 4 localRanks 8 localRank 3 MNNVL 0
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO Trees [0] 2/-1/-1->3->0 [1] 2/-1/-1->3->0
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO P2P Chunksize set to 131072
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO Trees [0] 14/-1/-1->13->9 [1] 14/-1/-1->13->9
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO P2P Chunksize set to 131072
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO comm 0x55600fb9c840 rank 27 nRanks 32 nNodes 4 localRanks 8 localRank 3 MNNVL 0
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO Trees [0] 26/-1/-1->27->24 [1] 26/-1/-1->27->24
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO P2P Chunksize set to 131072
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO comm 0x5641c3f88e40 rank 4 nRanks 32 nNodes 4 localRanks 8 localRank 4 MNNVL 0
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO Trees [0] -1/-1/-1->4->7 [1] -1/-1/-1->4->7
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO P2P Chunksize set to 131072
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO comm 0x560b8538e2c0 rank 10 nRanks 32 nNodes 4 localRanks 8 localRank 2 MNNVL 0
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO Trees [0] 9/-1/-1->10->11 [1] 9/-1/-1->10->11
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO P2P Chunksize set to 131072
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO P2P Chunksize set to 131072
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO comm 0x5648e3019c80 rank 8 nRanks 32 nNodes 4 localRanks 8 localRank 0 MNNVL 0
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO comm 0x55adee232c40 rank 26 nRanks 32 nNodes 4 localRanks 8 localRank 2 MNNVL 0
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO Trees [0] 25/-1/-1->26->27 [1] 25/-1/-1->26->27
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO P2P Chunksize set to 131072
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO comm 0x5613f202fdc0 rank 9 nRanks 32 nNodes 4 localRanks 8 localRank 1 MNNVL 0
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO comm 0x55b4b90f7540 rank 25 nRanks 32 nNodes 4 localRanks 8 localRank 1 MNNVL 0
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO Trees [0] 29/-1/-1->25->26 [1] 29/-1/-1->25->26
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO P2P Chunksize set to 131072
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO comm 0x564656b04380 rank 11 nRanks 32 nNodes 4 localRanks 8 localRank 3 MNNVL 0
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO Trees [0] 10/-1/-1->11->8 [1] 10/16/-1->11->8
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO P2P Chunksize set to 131072
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO P2P Chunksize set to 131072
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO comm 0x557b453cac40 rank 12 nRanks 32 nNodes 4 localRanks 8 localRank 4 MNNVL 0
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO Trees [0] -1/-1/-1->12->15 [1] -1/-1/-1->12->15
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO P2P Chunksize set to 131072
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO Trees [0] -1/-1/-1->28->31 [1] -1/-1/-1->28->31
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO P2P Chunksize set to 131072
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Trees [0] 11/-1/-1->8->19 [1] 11/0/-1->8->24
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO P2P Chunksize set to 131072
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO Trees [0] 27/-1/-1->24->16 [1] 27/8/-1->24->-1
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO P2P Chunksize set to 131072
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO Trees [0] 13/-1/-1->9->10 [1] 13/-1/-1->9->10
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO P2P Chunksize set to 131072
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO comm 0x5617488778d0 rank 19 nRanks 32 nNodes 4 localRanks 8 localRank 3 MNNVL 0
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO comm 0x565192948c00 rank 18 nRanks 32 nNodes 4 localRanks 8 localRank 2 MNNVL 0
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO Trees [0] 18/8/-1->19->16 [1] 18/-1/-1->19->16
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO P2P Chunksize set to 131072
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO comm 0x55e178f38c80 rank 17 nRanks 32 nNodes 4 localRanks 8 localRank 1 MNNVL 0
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO comm 0x556941882cc0 rank 16 nRanks 32 nNodes 4 localRanks 8 localRank 0 MNNVL 0
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO Trees [0] 17/-1/-1->18->19 [1] 17/-1/-1->18->19
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO P2P Chunksize set to 131072
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Trees [0] 19/24/-1->16->0 [1] 19/-1/-1->16->11
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO P2P Chunksize set to 131072
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO Trees [0] 21/-1/-1->17->18 [1] 21/-1/-1->17->18
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO P2P Chunksize set to 131072
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO Channel 00/0 : 13[5] -> 14[6] via P2P/CUMEM
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO Channel 00/0 : 29[5] -> 30[6] via P2P/CUMEM
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Channel 00/0 : 8[0] -> 11[3] via P2P/CUMEM
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO Channel 00/0 : 25[1] -> 29[5] via P2P/CUMEM
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Channel 00/0 : 0[0] -> 3[3] via P2P/CUMEM
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO Channel 00/0 : 9[1] -> 13[5] via P2P/CUMEM
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO Channel 00/0 : 1[1] -> 5[5] via P2P/CUMEM
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO Channel 00/0 : 24[0] -> 27[3] via P2P/CUMEM
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO Channel 00/0 : 22[6] -> 23[7] via P2P/CUMEM
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO Channel 01/0 : 13[5] -> 14[6] via P2P/CUMEM
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO Channel 01/0 : 29[5] -> 30[6] via P2P/CUMEM
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Channel 01/0 : 8[0] -> 11[3] via P2P/CUMEM
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO Channel 01/0 : 9[1] -> 13[5] via P2P/CUMEM
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO Channel 01/0 : 25[1] -> 29[5] via P2P/CUMEM
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO Channel 01/0 : 24[0] -> 27[3] via P2P/CUMEM
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Channel 01/0 : 0[0] -> 3[3] via P2P/CUMEM
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO Channel 00/0 : 18[2] -> 17[1] via P2P/CUMEM
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO Channel 01/0 : 1[1] -> 5[5] via P2P/CUMEM
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO Channel 01/0 : 22[6] -> 23[7] via P2P/CUMEM
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO Channel 00/0 : 21[5] -> 22[6] via P2P/CUMEM
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO Channel 00/0 : 14[6] -> 15[7] via P2P/CUMEM
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO Channel 00/0 : 10[2] -> 9[1] via P2P/CUMEM
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO Channel 01/0 : 18[2] -> 17[1] via P2P/CUMEM
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO Channel 00/0 : 30[6] -> 31[7] via P2P/CUMEM
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO Channel 00/0 : 23[7] -> 20[4] via P2P/CUMEM
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO Channel 00/0 : 26[2] -> 25[1] via P2P/CUMEM
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO Channel 01/0 : 14[6] -> 15[7] via P2P/CUMEM
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO Channel 01/0 : 10[2] -> 9[1] via P2P/CUMEM
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO Channel 01/0 : 21[5] -> 22[6] via P2P/CUMEM
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO Channel 00/0 : 27[3] -> 26[2] via P2P/CUMEM
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO Channel 01/0 : 30[6] -> 31[7] via P2P/CUMEM
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO Channel 01/0 : 26[2] -> 25[1] via P2P/CUMEM
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/CUMEM
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO Channel 00/0 : 15[7] -> 12[4] via P2P/CUMEM
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/CUMEM
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO Channel 01/0 : 27[3] -> 26[2] via P2P/CUMEM
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO Channel 00/0 : 11[3] -> 10[2] via P2P/CUMEM
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/CUMEM
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO Channel 00/0 : 17[1] -> 21[5] via P2P/CUMEM
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO Channel 00/0 : 20[4] -> 24[0] [send] via NET/Socket/0
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO Channel 00/0 : 31[7] -> 28[4] via P2P/CUMEM
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO Channel 01/0 : 20[4] -> 24[0] [send] via NET/Socket/0
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/CUMEM
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO Channel 01/0 : 15[7] -> 12[4] via P2P/CUMEM
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO Channel 01/0 : 23[7] -> 20[4] via P2P/CUMEM
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO Channel 01/0 : 11[3] -> 10[2] via P2P/CUMEM
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO Channel 00/0 : 7[7] -> 4[4] via P2P/CUMEM
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO Channel 01/0 : 31[7] -> 28[4] via P2P/CUMEM
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO Channel 00/0 : 12[4] -> 16[0] [send] via NET/Socket/0
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO Channel 00/0 : 28[4] -> 0[0] [send] via NET/Socket/0
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO Channel 01/0 : 17[1] -> 21[5] via P2P/CUMEM
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO Channel 01/0 : 12[4] -> 16[0] [send] via NET/Socket/0
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO Channel 01/0 : 28[4] -> 0[0] [send] via NET/Socket/0
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO Channel 01/0 : 7[7] -> 4[4] via P2P/CUMEM
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO Channel 00/0 : 4[4] -> 8[0] [send] via NET/Socket/0
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO Channel 01/0 : 4[4] -> 8[0] [send] via NET/Socket/0
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Channel 00/0 : 16[0] -> 19[3] via P2P/CUMEM
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Channel 01/0 : 16[0] -> 19[3] via P2P/CUMEM
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO Channel 00/0 : 19[3] -> 18[2] via P2P/CUMEM
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO Channel 00/0 : 20[4] -> 24[0] [receive] via NET/Socket/0
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO Channel 01/0 : 19[3] -> 18[2] via P2P/CUMEM
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO Channel 01/0 : 20[4] -> 24[0] [receive] via NET/Socket/0
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Channel 00/0 : 28[4] -> 0[0] [receive] via NET/Socket/0
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Channel 01/0 : 28[4] -> 0[0] [receive] via NET/Socket/0
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Channel 00/0 : 4[4] -> 8[0] [receive] via NET/Socket/0
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Channel 01/0 : 4[4] -> 8[0] [receive] via NET/Socket/0
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Channel 00/0 : 12[4] -> 16[0] [receive] via NET/Socket/0
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Channel 01/0 : 12[4] -> 16[0] [receive] via NET/Socket/0
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO Connected all rings
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO Channel 00/0 : 29[5] -> 25[1] via P2P/CUMEM
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO Connected all rings
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO Channel 00/0 : 25[1] -> 26[2] via P2P/CUMEM
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO Channel 01/0 : 29[5] -> 25[1] via P2P/CUMEM
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO Channel 01/0 : 25[1] -> 26[2] via P2P/CUMEM
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO Connected all rings
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO Channel 00/0 : 13[5] -> 9[1] via P2P/CUMEM
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO Connected all rings
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO Channel 00/0 : 9[1] -> 10[2] via P2P/CUMEM
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO Channel 01/0 : 13[5] -> 9[1] via P2P/CUMEM
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO Channel 01/0 : 9[1] -> 10[2] via P2P/CUMEM
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO Connected all rings
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO Connected all rings
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO Channel 00/0 : 21[5] -> 17[1] via P2P/CUMEM
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO Channel 00/0 : 30[6] -> 29[5] via P2P/CUMEM
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO Connected all rings
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO Channel 00/0 : 17[1] -> 18[2] via P2P/CUMEM
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO Channel 01/0 : 21[5] -> 17[1] via P2P/CUMEM
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO Channel 01/0 : 30[6] -> 29[5] via P2P/CUMEM
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO Connected all rings
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO Connected all rings
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO Channel 00/0 : 5[5] -> 1[1] via P2P/CUMEM
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO Connected all rings
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO Channel 00/0 : 14[6] -> 13[5] via P2P/CUMEM
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO Channel 01/0 : 17[1] -> 18[2] via P2P/CUMEM
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO Connected all rings
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO Channel 01/0 : 14[6] -> 13[5] via P2P/CUMEM
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO Connected all rings
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO Connected all rings
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO Channel 01/0 : 5[5] -> 1[1] via P2P/CUMEM
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO Connected all rings
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO Channel 00/0 : 22[6] -> 21[5] via P2P/CUMEM
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Connected all rings
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO Connected all rings
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Channel 01/0 : 0[0] -> 8[0] [receive] via NET/Socket/0
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO Connected all rings
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO Channel 00/0 : 4[4] -> 7[7] via P2P/CUMEM
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO Connected all rings
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO Connected all rings
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO Channel 01/0 : 22[6] -> 21[5] via P2P/CUMEM
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO Channel 00/0 : 10[2] -> 11[3] via P2P/CUMEM
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO Channel 01/0 : 10[2] -> 11[3] via P2P/CUMEM
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO Connected all rings
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO Channel 00/0 : 16[0] -> 24[0] [receive] via NET/Socket/0
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO Channel 01/0 : 4[4] -> 7[7] via P2P/CUMEM
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO Channel 00/0 : 26[2] -> 27[3] via P2P/CUMEM
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO Connected all rings
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO Connected all rings
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/CUMEM
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO Connected all rings
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO Connected all rings
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO Channel 00/0 : 20[4] -> 23[7] via P2P/CUMEM
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO Channel 01/0 : 26[2] -> 27[3] via P2P/CUMEM
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Connected all rings
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Channel 01/0 : 0[0] -> 8[0] [send] via NET/Socket/0
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO Connected all rings
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO Connected all rings
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO Channel 00/0 : 28[4] -> 31[7] via P2P/CUMEM
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO Channel 01/0 : 20[4] -> 23[7] via P2P/CUMEM
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO Channel 00/0 : 7[7] -> 6[6] via P2P/CUMEM
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO Channel 00/0 : 27[3] -> 24[0] via P2P/CUMEM
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/CUMEM
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Channel 00/0 : 16[0] -> 0[0] [receive] via NET/Socket/0
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Channel 00/0 : 0[0] -> 16[0] [send] via NET/Socket/0
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Channel 00/0 : 8[0] -> 19[3] [send] via NET/Socket/0
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO Channel 00/0 : 23[7] -> 22[6] via P2P/CUMEM
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO Channel 01/0 : 28[4] -> 31[7] via P2P/CUMEM
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO Channel 01/0 : 7[7] -> 6[6] via P2P/CUMEM
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO Channel 01/0 : 27[3] -> 24[0] via P2P/CUMEM
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO Channel 01/0 : 23[7] -> 22[6] via P2P/CUMEM
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO Channel 00/0 : 31[7] -> 30[6] via P2P/CUMEM
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO Channel 00/0 : 3[3] -> 0[0] via P2P/CUMEM
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO Channel 01/0 : 31[7] -> 30[6] via P2P/CUMEM
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO Connected all rings
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO Connected all rings
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO Channel 01/0 : 3[3] -> 0[0] via P2P/CUMEM
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Connected all rings
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO Channel 01/0 : 11[3] -> 16[0] [send] via NET/Socket/0
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO Connected all rings
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO Connected all rings
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO Channel 00/0 : 12[4] -> 15[7] via P2P/CUMEM
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Channel 01/0 : 11[3] -> 16[0] [receive] via NET/Socket/0
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO Channel 00/0 : 18[2] -> 19[3] via P2P/CUMEM
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO Channel 01/0 : 16[0] -> 11[3] [receive] via NET/Socket/0
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO Channel 01/0 : 12[4] -> 15[7] via P2P/CUMEM
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Channel 00/0 : 16[0] -> 24[0] [send] via NET/Socket/0
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO Channel 01/0 : 18[2] -> 19[3] via P2P/CUMEM
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO Channel 00/0 : 15[7] -> 14[6] via P2P/CUMEM
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Channel 00/0 : 0[0] -> 16[0] [receive] via NET/Socket/0
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO Channel 01/0 : 8[0] -> 24[0] [receive] via NET/Socket/0
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Channel 00/0 : 16[0] -> 0[0] [send] via NET/Socket/0
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO Channel 01/0 : 24[0] -> 8[0] [send] via NET/Socket/0
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO Channel 01/0 : 15[7] -> 14[6] via P2P/CUMEM
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Channel 00/0 : 24[0] -> 16[0] [receive] via NET/Socket/0
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Channel 01/0 : 8[0] -> 0[0] [receive] via NET/Socket/0
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO Connected all trees
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO Connected all trees
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO Channel 00/1 : 9[1] -> 12[4] via P2P/indirect/8[0]
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO Channel 00/0 : 8[0] -> 19[3] [receive] via NET/Socket/0
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Channel 01/0 : 24[0] -> 8[0] [receive] via NET/Socket/0
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO Channel 00/0 : 19[3] -> 8[0] [send] via NET/Socket/0
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Channel 01/0 : 8[0] -> 24[0] [send] via NET/Socket/0
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Channel 00/0 : 19[3] -> 8[0] [receive] via NET/Socket/0
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO Channel 00/0 : 24[0] -> 16[0] [send] via NET/Socket/0
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Channel 01/0 : 8[0] -> 0[0] [send] via NET/Socket/0
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO Connected all trees
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO Channel 00/0 : 19[3] -> 16[0] via P2P/CUMEM
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Channel 01/0 : 16[0] -> 11[3] [send] via NET/Socket/0
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO Channel 00/0 : 11[3] -> 8[0] via P2P/CUMEM
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO Channel 01/0 : 19[3] -> 16[0] via P2P/CUMEM
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO Connected all trees
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO Channel 00/1 : 1[1] -> 4[4] via P2P/indirect/0[0]
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO Connected all trees
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO Connected all trees
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO Connected all trees
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO Connected all trees
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO Connected all trees
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO Connected all trees
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO Channel 00/1 : 25[1] -> 28[4] via P2P/indirect/24[0]
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO Connected all trees
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO Connected all trees
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO Channel 01/0 : 11[3] -> 8[0] via P2P/CUMEM
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO Connected all trees
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO Connected all trees
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO Connected all trees
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO Connected all trees
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO Connected all trees
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO Connected all trees
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO Channel 00/1 : 17[1] -> 20[4] via P2P/indirect/16[0]
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO Connected all trees
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO Connected all trees
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO Connected all trees
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO Channel 00/1 : 26[2] -> 28[4] via P2P/indirect/24[0]
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO Connected all trees
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO Channel 00/1 : 2[2] -> 4[4] via P2P/indirect/0[0]
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO Connected all trees
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO Channel 00/1 : 18[2] -> 20[4] via P2P/indirect/16[0]
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO Connected all trees
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO Channel 00/1 : 10[2] -> 12[4] via P2P/indirect/8[0]
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO Connected all trees
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Connected all trees
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO Channel 00/1 : 3[3] -> 4[4] via P2P/indirect/0[0]
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Channel 00/1 : 0[0] -> 5[5] via P2P/indirect/1[1]
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO Channel 00/1 : 3[3] -> 5[5] via P2P/indirect/1[1]
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO Connected all trees
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO Channel 00/1 : 24[0] -> 29[5] via P2P/indirect/25[1]
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Connected all trees
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO Connected all trees
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO Channel 00/1 : 11[3] -> 12[4] via P2P/indirect/8[0]
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Channel 00/1 : 8[0] -> 13[5] via P2P/indirect/9[1]
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO Connected all trees
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO Channel 00/1 : 19[3] -> 20[4] via P2P/indirect/16[0]
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Connected all trees
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Channel 00/1 : 16[0] -> 21[5] via P2P/indirect/17[1]
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO Channel 00/1 : 3[3] -> 6[6] via P2P/indirect/7[7]
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO Connected all trees
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO Channel 00/1 : 27[3] -> 28[4] via P2P/indirect/24[0]
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO Channel 00/1 : 11[3] -> 13[5] via P2P/indirect/9[1]
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO Channel 00/1 : 19[3] -> 21[5] via P2P/indirect/17[1]
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO Channel 00/1 : 2[2] -> 5[5] via P2P/indirect/1[1]
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO Channel 00/1 : 27[3] -> 29[5] via P2P/indirect/25[1]
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO Channel 00/1 : 27[3] -> 30[6] via P2P/indirect/31[7]
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO Channel 00/1 : 19[3] -> 22[6] via P2P/indirect/23[7]
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO Channel 00/1 : 11[3] -> 14[6] via P2P/indirect/15[7]
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO Channel 00/1 : 10[2] -> 13[5] via P2P/indirect/9[1]
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO Channel 00/1 : 18[2] -> 21[5] via P2P/indirect/17[1]
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO Channel 00/1 : 10[2] -> 15[7] via P2P/indirect/14[6]
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO Channel 00/1 : 20[4] -> 17[1] via P2P/indirect/21[5]
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO Channel 00/1 : 12[4] -> 9[1] via P2P/indirect/13[5]
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO Channel 00/1 : 17[1] -> 22[6] via P2P/indirect/21[5]
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO Channel 00/1 : 9[1] -> 14[6] via P2P/indirect/13[5]
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO Channel 00/1 : 26[2] -> 29[5] via P2P/indirect/25[1]
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO Channel 00/1 : 18[2] -> 23[7] via P2P/indirect/22[6]
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO Channel 00/1 : 2[2] -> 7[7] via P2P/indirect/6[6]
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO Channel 00/1 : 1[1] -> 6[6] via P2P/indirect/5[5]
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO Channel 00/1 : 28[4] -> 25[1] via P2P/indirect/29[5]
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO Channel 00/1 : 4[4] -> 1[1] via P2P/indirect/5[5]
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO Channel 00/1 : 26[2] -> 31[7] via P2P/indirect/30[6]
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO Channel 00/1 : 13[5] -> 8[0] via P2P/indirect/12[4]
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Channel 00/1 : 8[0] -> 14[6] via P2P/indirect/12[4]
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO Channel 00/1 : 9[1] -> 15[7] via P2P/indirect/11[3]
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Channel 00/1 : 0[0] -> 6[6] via P2P/indirect/4[4]
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO Channel 00/1 : 5[5] -> 0[0] via P2P/indirect/4[4]
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Channel 00/1 : 16[0] -> 22[6] via P2P/indirect/20[4]
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO Channel 00/1 : 21[5] -> 16[0] via P2P/indirect/20[4]
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO Channel 00/1 : 25[1] -> 30[6] via P2P/indirect/29[5]
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO Channel 00/1 : 17[1] -> 23[7] via P2P/indirect/19[3]
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO Channel 00/1 : 1[1] -> 7[7] via P2P/indirect/3[3]
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Channel 00/1 : 8[0] -> 15[7] via P2P/indirect/12[4]
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO Channel 00/1 : 14[6] -> 8[0] via P2P/indirect/12[4]
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO Channel 00/1 : 24[0] -> 30[6] via P2P/indirect/28[4]
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO Channel 00/1 : 29[5] -> 24[0] via P2P/indirect/28[4]
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO Channel 00/1 : 25[1] -> 31[7] via P2P/indirect/27[3]
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO Channel 00/1 : 15[7] -> 8[0] via P2P/indirect/12[4]
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Channel 00/1 : 0[0] -> 7[7] via P2P/indirect/4[4]
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO Channel 00/1 : 6[6] -> 0[0] via P2P/indirect/4[4]
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO Channel 00/1 : 15[7] -> 9[1] via P2P/indirect/13[5]
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO Channel 00/1 : 22[6] -> 16[0] via P2P/indirect/20[4]
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Channel 00/1 : 16[0] -> 23[7] via P2P/indirect/20[4]
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO Channel 00/1 : 7[7] -> 0[0] via P2P/indirect/4[4]
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO Channel 00/1 : 14[6] -> 9[1] via P2P/indirect/13[5]
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO Channel 00/1 : 15[7] -> 10[2] via P2P/indirect/11[3]
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO Channel 00/1 : 23[7] -> 16[0] via P2P/indirect/20[4]
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO Channel 00/1 : 7[7] -> 1[1] via P2P/indirect/5[5]
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO Channel 00/1 : 24[0] -> 31[7] via P2P/indirect/28[4]
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO Channel 00/1 : 30[6] -> 24[0] via P2P/indirect/28[4]
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO Channel 00/1 : 13[5] -> 10[2] via P2P/indirect/9[1]
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO Channel 00/1 : 6[6] -> 1[1] via P2P/indirect/5[5]
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO Channel 00/1 : 31[7] -> 24[0] via P2P/indirect/28[4]
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO Channel 00/1 : 23[7] -> 17[1] via P2P/indirect/21[5]
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO Channel 00/1 : 7[7] -> 2[2] via P2P/indirect/3[3]
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO Channel 00/1 : 22[6] -> 17[1] via P2P/indirect/21[5]
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO Channel 00/1 : 5[5] -> 2[2] via P2P/indirect/1[1]
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO Channel 00/1 : 23[7] -> 18[2] via P2P/indirect/19[3]
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO Channel 00/1 : 21[5] -> 18[2] via P2P/indirect/17[1]
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO Channel 00/1 : 31[7] -> 25[1] via P2P/indirect/29[5]
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO Channel 00/1 : 6[6] -> 3[3] via P2P/indirect/2[2]
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO Channel 00/1 : 14[6] -> 11[3] via P2P/indirect/10[2]
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO Channel 00/1 : 22[6] -> 19[3] via P2P/indirect/18[2]
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO Channel 00/1 : 30[6] -> 25[1] via P2P/indirect/29[5]
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO Channel 00/1 : 31[7] -> 26[2] via P2P/indirect/27[3]
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO Channel 00/1 : 5[5] -> 3[3] via P2P/indirect/1[1]
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO Channel 00/1 : 4[4] -> 2[2] via P2P/indirect/6[6]
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO Channel 00/1 : 29[5] -> 26[2] via P2P/indirect/25[1]
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO Channel 00/1 : 21[5] -> 19[3] via P2P/indirect/17[1]
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO Channel 00/1 : 13[5] -> 11[3] via P2P/indirect/9[1]
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO Channel 00/1 : 20[4] -> 18[2] via P2P/indirect/22[6]
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO Channel 00/1 : 30[6] -> 27[3] via P2P/indirect/26[2]
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO Channel 00/1 : 29[5] -> 27[3] via P2P/indirect/25[1]
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO Channel 00/1 : 4[4] -> 3[3] via P2P/indirect/0[0]
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO Channel 00/1 : 28[4] -> 26[2] via P2P/indirect/30[6]
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO Channel 00/1 : 12[4] -> 10[2] via P2P/indirect/14[6]
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO Channel 00/1 : 20[4] -> 19[3] via P2P/indirect/16[0]
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO Channel 00/1 : 28[4] -> 27[3] via P2P/indirect/24[0]
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO Channel 00/1 : 12[4] -> 11[3] via P2P/indirect/8[0]
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO comm 0x55e178f38c80 rank 17 nranks 32 cudaDev 1 nvmlDev 1 busId 180 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO comm 0x556941882cc0 rank 16 nranks 32 cudaDev 0 nvmlDev 0 busId 170 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO comm 0x5582a4a63380 rank 23 nranks 32 cudaDev 7 nvmlDev 7 busId 1e0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO comm 0x5617488778d0 rank 19 nranks 32 cudaDev 3 nvmlDev 3 busId 1a0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO comm 0x5571768f15c0 rank 21 nranks 32 cudaDev 5 nvmlDev 5 busId 1c0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO comm 0x559fbed3b980 rank 20 nranks 32 cudaDev 4 nvmlDev 4 busId 1b0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO comm 0x565192948c00 rank 18 nranks 32 cudaDev 2 nvmlDev 2 busId 190 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO comm 0x5648da365f80 rank 22 nranks 32 cudaDev 6 nvmlDev 6 busId 1d0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO comm 0x557b453cac40 rank 12 nranks 32 cudaDev 4 nvmlDev 4 busId 1b0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO comm 0x5648e3019c80 rank 8 nranks 32 cudaDev 0 nvmlDev 0 busId 170 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO comm 0x560b8538e2c0 rank 10 nranks 32 cudaDev 2 nvmlDev 2 busId 190 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO comm 0x55e5f0da8fc0 rank 14 nranks 32 cudaDev 6 nvmlDev 6 busId 1d0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO comm 0x5591c7c2f340 rank 13 nranks 32 cudaDev 5 nvmlDev 5 busId 1c0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO comm 0x5613f202fdc0 rank 9 nranks 32 cudaDev 1 nvmlDev 1 busId 180 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO comm 0x564656b04380 rank 11 nranks 32 cudaDev 3 nvmlDev 3 busId 1a0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO comm 0x55833c92f9c0 rank 15 nranks 32 cudaDev 7 nvmlDev 7 busId 1e0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO comm 0x564c64f91780 rank 5 nranks 32 cudaDev 5 nvmlDev 5 busId 1c0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO comm 0x559fe7380640 rank 1 nranks 32 cudaDev 1 nvmlDev 1 busId 180 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO comm 0x55986c6b8ac0 rank 3 nranks 32 cudaDev 3 nvmlDev 3 busId 1a0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO comm 0x563f66c76d80 rank 7 nranks 32 cudaDev 7 nvmlDev 7 busId 1e0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO comm 0x55b4b90f7540 rank 25 nranks 32 cudaDev 1 nvmlDev 1 busId 180 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO comm 0x55768c2aaf40 rank 29 nranks 32 cudaDev 5 nvmlDev 5 busId 1c0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO comm 0x5609cdbbff40 rank 30 nranks 32 cudaDev 6 nvmlDev 6 busId 1d0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO comm 0x558a2e4d59c0 rank 31 nranks 32 cudaDev 7 nvmlDev 7 busId 1e0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO comm 0x55600fb9c840 rank 27 nranks 32 cudaDev 3 nvmlDev 3 busId 1a0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO comm 0x55bdd0698c80 rank 28 nranks 32 cudaDev 4 nvmlDev 4 busId 1b0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO comm 0x556aae88b300 rank 24 nranks 32 cudaDev 0 nvmlDev 0 busId 170 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO comm 0x55adee232c40 rank 26 nranks 32 cudaDev 2 nvmlDev 2 busId 190 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO comm 0x5653dbf7d100 rank 2 nranks 32 cudaDev 2 nvmlDev 2 busId 190 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO comm 0x5640d3700dc0 rank 6 nranks 32 cudaDev 6 nvmlDev 6 busId 1d0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO comm 0x5641c3f88e40 rank 4 nranks 32 cudaDev 4 nvmlDev 4 busId 1b0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO comm 0x55bc31870680 rank 0 nranks 32 cudaDev 0 nvmlDev 0 busId 170 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,31]<stdout>:Process 31 - Training time: 10.09 seconds
[1,31]<stdout>:Process 31 - Throughput: 9.91 samples/second
[1,29]<stdout>:Process 29 - Training time: 10.05 seconds
[1,29]<stdout>:Process 29 - Throughput: 9.95 samples/second
[1,28]<stdout>:Process 28 - Training time: 10.09 seconds
[1,28]<stdout>:Process 28 - Throughput: 9.91 samples/second
[1,25]<stdout>:Process 25 - Training time: 10.04 seconds
[1,25]<stdout>:Process 25 - Throughput: 9.96 samples/second
[1,27]<stdout>:Process 27 - Training time: 10.10 seconds
[1,27]<stdout>:Process 27 - Throughput: 9.90 samples/second
[1,20]<stdout>:Process 20 - Training time: 10.09 seconds
[1,20]<stdout>:Process 20 - Throughput: 9.91 samples/second
[1,3]<stdout>:Process 3 - Training time: 10.07 seconds
[1,3]<stdout>:Process 3 - Throughput: 9.93 samples/second
[1,0]<stdout>:Process 0 - Training time: 10.03 seconds
[1,0]<stdout>:Process 0 - Throughput: 9.97 samples/second
[1,23]<stdout>:Process 23 - Training time: 10.04 seconds
[1,23]<stdout>:Process 23 - Throughput: 9.96 samples/second
[1,24]<stdout>:Process 24 - Training time: 10.10 seconds
[1,24]<stdout>:Process 24 - Throughput: 9.90 samples/second
[1,2]<stdout>:Process 2 - Training time: 10.14 seconds
[1,2]<stdout>:Process 2 - Throughput: 9.86 samples/second
[1,5]<stdout>:Process 5 - Training time: 10.08 seconds
[1,5]<stdout>:Process 5 - Throughput: 9.92 samples/second
[1,21]<stdout>:Process 21 - Training time: 10.08 seconds
[1,21]<stdout>:Process 21 - Throughput: 9.92 samples/second
[1,22]<stdout>:Process 22 - Training time: 10.07 seconds
[1,22]<stdout>:Process 22 - Throughput: 9.93 samples/second
[1,30]<stdout>:Process 30 - Training time: 10.09 seconds
[1,30]<stdout>:Process 30 - Throughput: 9.91 samples/second
[1,1]<stdout>:Process 1 - Training time: 10.07 seconds
[1,1]<stdout>:Process 1 - Throughput: 9.93 samples/second
[1,17]<stdout>:Process 17 - Training time: 10.11 seconds
[1,17]<stdout>:Process 17 - Throughput: 9.89 samples/second
[1,12]<stdout>:Process 12 - Training time: 10.01 seconds
[1,12]<stdout>:Process 12 - Throughput: 9.99 samples/second
[1,6]<stdout>:Process 6 - Training time: 10.04 seconds
[1,6]<stdout>:Process 6 - Throughput: 9.96 samples/second
[1,18]<stdout>:Process 18 - Training time: 10.12 seconds
[1,18]<stdout>:Process 18 - Throughput: 9.88 samples/second
[1,7]<stdout>:Process 7 - Training time: 10.11 seconds
[1,7]<stdout>:Process 7 - Throughput: 9.89 samples/second
[1,15]<stdout>:Process 15 - Training time: 10.14 seconds
[1,15]<stdout>:Process 15 - Throughput: 9.86 samples/second
[1,19]<stdout>:Process 19 - Training time: 10.12 seconds
[1,19]<stdout>:Process 19 - Throughput: 9.89 samples/second
[1,14]<stdout>:Process 14 - Training time: 9.96 seconds
[1,14]<stdout>:Process 14 - Throughput: 10.04 samples/second
[1,13]<stdout>:Process 13 - Training time: 10.05 seconds
[1,13]<stdout>:Process 13 - Throughput: 9.95 samples/second
[1,16]<stdout>:Process 16 - Training time: 10.10 seconds
[1,16]<stdout>:Process 16 - Throughput: 9.90 samples/second
[1,26]<stdout>:Process 26 - Training time: 10.11 seconds
[1,26]<stdout>:Process 26 - Throughput: 9.89 samples/second
[1,10]<stdout>:Process 10 - Training time: 10.12 seconds
[1,10]<stdout>:Process 10 - Throughput: 9.88 samples/second
[1,11]<stdout>:Process 11 - Training time: 10.10 seconds
[1,11]<stdout>:Process 11 - Throughput: 9.90 samples/second
[1,8]<stdout>:Process 8 - Training time: 10.09 seconds
[1,8]<stdout>:Process 8 - Throughput: 9.91 samples/second
[1,4]<stdout>:Process 4 - Training time: 10.05 seconds
[1,4]<stdout>:Process 4 - Throughput: 9.95 samples/second
[1,9]<stdout>:Process 9 - Training time: 10.08 seconds
[1,9]<stdout>:Process 9 - Throughput: 9.92 samples/second
[1,21]<stdout>:bert-mpi-training-worker-2:26:576 [5] NCCL INFO [Service thread] Connection closed by localRank 7
[1,23]<stdout>:bert-mpi-training-worker-2:28:581 [7] NCCL INFO [Service thread] Connection closed by localRank 7
[1,19]<stdout>:bert-mpi-training-worker-2:24:583 [3] NCCL INFO [Service thread] Connection closed by localRank 7
[1,20]<stdout>:bert-mpi-training-worker-2:25:578 [4] NCCL INFO [Service thread] Connection closed by localRank 7
[1,20]<stdout>:bert-mpi-training-worker-2:25:578 [4] NCCL INFO [Service thread] Connection closed by localRank 5
[1,17]<stdout>:bert-mpi-training-worker-2:22:587 [1] NCCL INFO [Service thread] Connection closed by localRank 5
[1,21]<stdout>:bert-mpi-training-worker-2:26:576 [5] NCCL INFO [Service thread] Connection closed by localRank 5
[1,18]<stdout>:bert-mpi-training-worker-2:23:584 [2] NCCL INFO [Service thread] Connection closed by localRank 6
[1,21]<stdout>:bert-mpi-training-worker-2:26:576 [5] NCCL INFO [Service thread] Connection closed by localRank 6
[1,20]<stdout>:bert-mpi-training-worker-2:25:578 [4] NCCL INFO [Service thread] Connection closed by localRank 6
[1,22]<stdout>:bert-mpi-training-worker-2:27:575 [6] NCCL INFO [Service thread] Connection closed by localRank 6
[1,16]<stdout>:bert-mpi-training-worker-2:21:588 [0] NCCL INFO [Service thread] Connection closed by localRank 2
[1,17]<stdout>:bert-mpi-training-worker-2:22:587 [1] NCCL INFO [Service thread] Connection closed by localRank 2
[1,18]<stdout>:bert-mpi-training-worker-2:23:584 [2] NCCL INFO [Service thread] Connection closed by localRank 2
[1,22]<stdout>:bert-mpi-training-worker-2:27:575 [6] NCCL INFO [Service thread] Connection closed by localRank 2
[1,16]<stdout>:bert-mpi-training-worker-2:21:588 [0] NCCL INFO [Service thread] Connection closed by localRank 1
[1,21]<stdout>:bert-mpi-training-worker-2:26:576 [5] NCCL INFO [Service thread] Connection closed by localRank 1
[1,19]<stdout>:bert-mpi-training-worker-2:24:583 [3] NCCL INFO [Service thread] Connection closed by localRank 1
[1,17]<stdout>:bert-mpi-training-worker-2:22:587 [1] NCCL INFO [Service thread] Connection closed by localRank 1
[1,16]<stdout>:bert-mpi-training-worker-2:21:588 [0] NCCL INFO [Service thread] Connection closed by localRank 3
[1,17]<stdout>:bert-mpi-training-worker-2:22:587 [1] NCCL INFO [Service thread] Connection closed by localRank 3
[1,19]<stdout>:bert-mpi-training-worker-2:24:583 [3] NCCL INFO [Service thread] Connection closed by localRank 3
[1,23]<stdout>:bert-mpi-training-worker-2:28:581 [7] NCCL INFO [Service thread] Connection closed by localRank 3
[1,18]<stdout>:bert-mpi-training-worker-2:23:668 [0] NCCL INFO comm 0x565192948c00 rank 18 nranks 32 cudaDev 2 busId 190 - Abort COMPLETE
[1,23]<stdout>:bert-mpi-training-worker-2:28:664 [0] NCCL INFO comm 0x5582a4a63380 rank 23 nranks 32 cudaDev 7 busId 1e0 - Abort COMPLETE
[1,19]<stdout>:bert-mpi-training-worker-2:24:669 [0] NCCL INFO comm 0x5617488778d0 rank 19 nranks 32 cudaDev 3 busId 1a0 - Abort COMPLETE
[1,0]<stdout>:bert-mpi-training-worker-0:21:583 [0] NCCL INFO [Service thread] Connection closed by localRank 3
[1,1]<stdout>:bert-mpi-training-worker-0:22:582 [1] NCCL INFO [Service thread] Connection closed by localRank 3
[1,3]<stdout>:bert-mpi-training-worker-0:24:591 [3] NCCL INFO [Service thread] Connection closed by localRank 3
[1,7]<stdout>:bert-mpi-training-worker-0:28:578 [7] NCCL INFO [Service thread] Connection closed by localRank 3
[1,0]<stdout>:bert-mpi-training-worker-0:21:583 [0] NCCL INFO [Service thread] Connection closed by localRank 1
[1,3]<stdout>:bert-mpi-training-worker-0:24:591 [3] NCCL INFO [Service thread] Connection closed by localRank 1
[1,5]<stdout>:bert-mpi-training-worker-0:26:585 [5] NCCL INFO [Service thread] Connection closed by localRank 1
[1,1]<stdout>:bert-mpi-training-worker-0:22:582 [1] NCCL INFO [Service thread] Connection closed by localRank 1
[1,2]<stdout>:bert-mpi-training-worker-0:23:584 [2] NCCL INFO [Service thread] Connection closed by localRank 6
[1,4]<stdout>:bert-mpi-training-worker-0:25:589 [4] NCCL INFO [Service thread] Connection closed by localRank 6
[1,5]<stdout>:bert-mpi-training-worker-0:26:585 [5] NCCL INFO [Service thread] Connection closed by localRank 6
[1,6]<stdout>:bert-mpi-training-worker-0:27:579 [6] NCCL INFO [Service thread] Connection closed by localRank 6
[1,4]<stdout>:bert-mpi-training-worker-0:25:589 [4] NCCL INFO [Service thread] Connection closed by localRank 5
[1,1]<stdout>:bert-mpi-training-worker-0:22:582 [1] NCCL INFO [Service thread] Connection closed by localRank 5
[1,5]<stdout>:bert-mpi-training-worker-0:26:585 [5] NCCL INFO [Service thread] Connection closed by localRank 5
[1,1]<stdout>:bert-mpi-training-worker-0:22:582 [1] NCCL INFO [Service thread] Connection closed by localRank 2
[1,0]<stdout>:bert-mpi-training-worker-0:21:583 [0] NCCL INFO [Service thread] Connection closed by localRank 2
[1,2]<stdout>:bert-mpi-training-worker-0:23:584 [2] NCCL INFO [Service thread] Connection closed by localRank 2
[1,6]<stdout>:bert-mpi-training-worker-0:27:579 [6] NCCL INFO [Service thread] Connection closed by localRank 2
[1,3]<stdout>:bert-mpi-training-worker-0:24:591 [3] NCCL INFO [Service thread] Connection closed by localRank 7
[1,4]<stdout>:bert-mpi-training-worker-0:25:589 [4] NCCL INFO [Service thread] Connection closed by localRank 7
[1,5]<stdout>:bert-mpi-training-worker-0:26:585 [5] NCCL INFO [Service thread] Connection closed by localRank 7
[1,7]<stdout>:bert-mpi-training-worker-0:28:578 [7] NCCL INFO [Service thread] Connection closed by localRank 7
[1,12]<stdout>:bert-mpi-training-worker-1:25:582 [4] NCCL INFO [Service thread] Connection closed by localRank 5
[1,13]<stdout>:bert-mpi-training-worker-1:26:578 [5] NCCL INFO [Service thread] Connection closed by localRank 5
[1,9]<stdout>:bert-mpi-training-worker-1:22:585 [1] NCCL INFO [Service thread] Connection closed by localRank 5
[1,29]<stdout>:bert-mpi-training-worker-3:26:577 [5] NCCL INFO [Service thread] Connection closed by localRank 7
[1,28]<stdout>:bert-mpi-training-worker-3:25:583 [4] NCCL INFO [Service thread] Connection closed by localRank 7
[1,31]<stdout>:bert-mpi-training-worker-3:28:576 [7] NCCL INFO [Service thread] Connection closed by localRank 7
[1,27]<stdout>:bert-mpi-training-worker-3:24:578 [3] NCCL INFO [Service thread] Connection closed by localRank 7
[1,13]<stdout>:bert-mpi-training-worker-1:26:578 [5] NCCL INFO [Service thread] Connection closed by localRank 6
[1,12]<stdout>:bert-mpi-training-worker-1:25:582 [4] NCCL INFO [Service thread] Connection closed by localRank 6
[1,10]<stdout>:bert-mpi-training-worker-1:23:580 [2] NCCL INFO [Service thread] Connection closed by localRank 6
[1,14]<stdout>:bert-mpi-training-worker-1:27:577 [6] NCCL INFO [Service thread] Connection closed by localRank 6
[1,26]<stdout>:bert-mpi-training-worker-3:23:579 [2] NCCL INFO [Service thread] Connection closed by localRank 6
[1,29]<stdout>:bert-mpi-training-worker-3:26:577 [5] NCCL INFO [Service thread] Connection closed by localRank 6
[1,28]<stdout>:bert-mpi-training-worker-3:25:583 [4] NCCL INFO [Service thread] Connection closed by localRank 6
[1,30]<stdout>:bert-mpi-training-worker-3:27:575 [6] NCCL INFO [Service thread] Connection closed by localRank 6
[1,28]<stdout>:bert-mpi-training-worker-3:25:583 [4] NCCL INFO [Service thread] Connection closed by localRank 5
[1,25]<stdout>:bert-mpi-training-worker-3:22:584 [1] NCCL INFO [Service thread] Connection closed by localRank 5
[1,29]<stdout>:bert-mpi-training-worker-3:26:577 [5] NCCL INFO [Service thread] Connection closed by localRank 5
[1,8]<stdout>:bert-mpi-training-worker-1:21:583 [0] NCCL INFO [Service thread] Connection closed by localRank 2
[1,9]<stdout>:bert-mpi-training-worker-1:22:585 [1] NCCL INFO [Service thread] Connection closed by localRank 2
[1,14]<stdout>:bert-mpi-training-worker-1:27:577 [6] NCCL INFO [Service thread] Connection closed by localRank 2
[1,10]<stdout>:bert-mpi-training-worker-1:23:580 [2] NCCL INFO [Service thread] Connection closed by localRank 2
[1,13]<stdout>:bert-mpi-training-worker-1:26:578 [5] NCCL INFO [Service thread] Connection closed by localRank 7
[1,12]<stdout>:bert-mpi-training-worker-1:25:582 [4] NCCL INFO [Service thread] Connection closed by localRank 7
[1,11]<stdout>:bert-mpi-training-worker-1:24:579 [3] NCCL INFO [Service thread] Connection closed by localRank 7
[1,15]<stdout>:bert-mpi-training-worker-1:28:575 [7] NCCL INFO [Service thread] Connection closed by localRank 7
[1,8]<stdout>:bert-mpi-training-worker-1:21:583 [0] NCCL INFO [Service thread] Connection closed by localRank 1
[1,9]<stdout>:bert-mpi-training-worker-1:22:585 [1] NCCL INFO [Service thread] Connection closed by localRank 1
[1,13]<stdout>:bert-mpi-training-worker-1:26:578 [5] NCCL INFO [Service thread] Connection closed by localRank 1
[1,11]<stdout>:bert-mpi-training-worker-1:24:579 [3] NCCL INFO [Service thread] Connection closed by localRank 1
[1,2]<stdout>:bert-mpi-training-worker-0:23:668 [0] NCCL INFO comm 0x5653dbf7d100 rank 2 nranks 32 cudaDev 2 busId 190 - Abort COMPLETE
[1,7]<stdout>:bert-mpi-training-worker-0:28:672 [0] NCCL INFO comm 0x563f66c76d80 rank 7 nranks 32 cudaDev 7 busId 1e0 - Abort COMPLETE
[1,25]<stdout>:bert-mpi-training-worker-3:22:584 [1] NCCL INFO [Service thread] Connection closed by localRank 2
[1,24]<stdout>:bert-mpi-training-worker-3:21:581 [0] NCCL INFO [Service thread] Connection closed by localRank 2
[1,26]<stdout>:bert-mpi-training-worker-3:23:579 [2] NCCL INFO [Service thread] Connection closed by localRank 2
[1,30]<stdout>:bert-mpi-training-worker-3:27:575 [6] NCCL INFO [Service thread] Connection closed by localRank 2
[1,3]<stdout>:bert-mpi-training-worker-0:24:666 [0] NCCL INFO comm 0x55986c6b8ac0 rank 3 nranks 32 cudaDev 3 busId 1a0 - Abort COMPLETE
[1,24]<stdout>:bert-mpi-training-worker-3:21:581 [0] NCCL INFO [Service thread] Connection closed by localRank 3
[1,25]<stdout>:bert-mpi-training-worker-3:22:584 [1] NCCL INFO [Service thread] Connection closed by localRank 3
[1,31]<stdout>:bert-mpi-training-worker-3:28:576 [7] NCCL INFO [Service thread] Connection closed by localRank 3
[1,27]<stdout>:bert-mpi-training-worker-3:24:578 [3] NCCL INFO [Service thread] Connection closed by localRank 3
[1,8]<stdout>:bert-mpi-training-worker-1:21:583 [0] NCCL INFO [Service thread] Connection closed by localRank 3
[1,9]<stdout>:bert-mpi-training-worker-1:22:585 [1] NCCL INFO [Service thread] Connection closed by localRank 3
[1,15]<stdout>:bert-mpi-training-worker-1:28:575 [7] NCCL INFO [Service thread] Connection closed by localRank 3
[1,11]<stdout>:bert-mpi-training-worker-1:24:579 [3] NCCL INFO [Service thread] Connection closed by localRank 3
[1,10]<stdout>:bert-mpi-training-worker-1:23:667 [0] NCCL INFO comm 0x560b8538e2c0 rank 10 nranks 32 cudaDev 2 busId 190 - Abort COMPLETE
[1,24]<stdout>:bert-mpi-training-worker-3:21:581 [0] NCCL INFO [Service thread] Connection closed by localRank 1
[1,27]<stdout>:bert-mpi-training-worker-3:24:578 [3] NCCL INFO [Service thread] Connection closed by localRank 1
[1,25]<stdout>:bert-mpi-training-worker-3:22:584 [1] NCCL INFO [Service thread] Connection closed by localRank 1
[1,29]<stdout>:bert-mpi-training-worker-3:26:577 [5] NCCL INFO [Service thread] Connection closed by localRank 1
[1,26]<stdout>:bert-mpi-training-worker-3:23:670 [0] NCCL INFO comm 0x55adee232c40 rank 26 nranks 32 cudaDev 2 busId 190 - Abort COMPLETE
[1,31]<stdout>:bert-mpi-training-worker-3:28:663 [0] NCCL INFO comm 0x558a2e4d59c0 rank 31 nranks 32 cudaDev 7 busId 1e0 - Abort COMPLETE
[1,15]<stdout>:bert-mpi-training-worker-1:28:664 [0] NCCL INFO comm 0x55833c92f9c0 rank 15 nranks 32 cudaDev 7 busId 1e0 - Abort COMPLETE
[1,11]<stdout>:bert-mpi-training-worker-1:24:668 [0] NCCL INFO comm 0x564656b04380 rank 11 nranks 32 cudaDev 3 busId 1a0 - Abort COMPLETE
[1,27]<stdout>:bert-mpi-training-worker-3:24:667 [0] NCCL INFO comm 0x55600fb9c840 rank 27 nranks 32 cudaDev 3 busId 1a0 - Abort COMPLETE
[1,8]<stdout>:bert-mpi-training-worker-1:21:583 [0] NCCL INFO [Service thread] Connection closed by localRank 0
[1,9]<stdout>:bert-mpi-training-worker-1:22:585 [1] NCCL INFO [Service thread] Connection closed by localRank 0
[1,12]<stdout>:bert-mpi-training-worker-1:25:582 [4] NCCL INFO [Service thread] Connection closed by localRank 0
[1,9]<stdout>:bert-mpi-training-worker-1:22:670 [0] NCCL INFO comm 0x5613f202fdc0 rank 9 nranks 32 cudaDev 1 busId 180 - Abort COMPLETE
[1,24]<stdout>:bert-mpi-training-worker-3:21:581 [0] NCCL INFO [Service thread] Connection closed by localRank 4
[1,28]<stdout>:bert-mpi-training-worker-3:25:583 [4] NCCL INFO [Service thread] Connection closed by localRank 4
[1,30]<stdout>:bert-mpi-training-worker-3:27:575 [6] NCCL INFO [Service thread] Connection closed by localRank 4
[1,29]<stdout>:bert-mpi-training-worker-3:26:577 [5] NCCL INFO [Service thread] Connection closed by localRank 4
[1,30]<stdout>:bert-mpi-training-worker-3:27:669 [0] NCCL INFO comm 0x5609cdbbff40 rank 30 nranks 32 cudaDev 6 busId 1d0 - Abort COMPLETE
[1,29]<stdout>:bert-mpi-training-worker-3:26:664 [0] NCCL INFO comm 0x55768c2aaf40 rank 29 nranks 32 cudaDev 5 busId 1c0 - Abort COMPLETE
[1,16]<stdout>:bert-mpi-training-worker-2:21:588 [0] NCCL INFO [Service thread] Connection closed by localRank 4
[1,21]<stdout>:bert-mpi-training-worker-2:26:576 [5] NCCL INFO [Service thread] Connection closed by localRank 4
[1,22]<stdout>:bert-mpi-training-worker-2:27:575 [6] NCCL INFO [Service thread] Connection closed by localRank 4
[1,20]<stdout>:bert-mpi-training-worker-2:25:578 [4] NCCL INFO [Service thread] Connection closed by localRank 4
[1,22]<stdout>:bert-mpi-training-worker-2:27:666 [0] NCCL INFO comm 0x5648da365f80 rank 22 nranks 32 cudaDev 6 busId 1d0 - Abort COMPLETE
[1,21]<stdout>:bert-mpi-training-worker-2:26:665 [0] NCCL INFO comm 0x5571768f15c0 rank 21 nranks 32 cudaDev 5 busId 1c0 - Abort COMPLETE
[1,16]<stdout>:bert-mpi-training-worker-2:21:588 [0] NCCL INFO [Service thread] Connection closed by localRank 0
[1,20]<stdout>:bert-mpi-training-worker-2:25:578 [4] NCCL INFO [Service thread] Connection closed by localRank 0
[1,17]<stdout>:bert-mpi-training-worker-2:22:587 [1] NCCL INFO [Service thread] Connection closed by localRank 0
[1,17]<stdout>:bert-mpi-training-worker-2:22:667 [0] NCCL INFO comm 0x55e178f38c80 rank 17 nranks 32 cudaDev 1 busId 180 - Abort COMPLETE
[1,20]<stdout>:bert-mpi-training-worker-2:25:663 [0] NCCL INFO comm 0x559fbed3b980 rank 20 nranks 32 cudaDev 4 busId 1b0 - Abort COMPLETE
[1,16]<stdout>:bert-mpi-training-worker-2:21:670 [0] NCCL INFO comm 0x556941882cc0 rank 16 nranks 32 cudaDev 0 busId 170 - Abort COMPLETE
[1,24]<stdout>:bert-mpi-training-worker-3:21:581 [0] NCCL INFO [Service thread] Connection closed by localRank 0
[1,28]<stdout>:bert-mpi-training-worker-3:25:583 [4] NCCL INFO [Service thread] Connection closed by localRank 0
[1,25]<stdout>:bert-mpi-training-worker-3:22:584 [1] NCCL INFO [Service thread] Connection closed by localRank 0
[1,25]<stdout>:bert-mpi-training-worker-3:22:666 [0] NCCL INFO comm 0x55b4b90f7540 rank 25 nranks 32 cudaDev 1 busId 180 - Abort COMPLETE
[1,28]<stdout>:bert-mpi-training-worker-3:25:665 [0] NCCL INFO comm 0x55bdd0698c80 rank 28 nranks 32 cudaDev 4 busId 1b0 - Abort COMPLETE
[1,24]<stdout>:bert-mpi-training-worker-3:21:668 [0] NCCL INFO comm 0x556aae88b300 rank 24 nranks 32 cudaDev 0 busId 170 - Abort COMPLETE
[1,8]<stdout>:bert-mpi-training-worker-1:21:583 [0] NCCL INFO [Service thread] Connection closed by localRank 4
[1,14]<stdout>:bert-mpi-training-worker-1:27:577 [6] NCCL INFO [Service thread] Connection closed by localRank 4
[1,12]<stdout>:bert-mpi-training-worker-1:25:582 [4] NCCL INFO [Service thread] Connection closed by localRank 4
[1,13]<stdout>:bert-mpi-training-worker-1:26:578 [5] NCCL INFO [Service thread] Connection closed by localRank 4
[1,14]<stdout>:bert-mpi-training-worker-1:27:665 [0] NCCL INFO comm 0x55e5f0da8fc0 rank 14 nranks 32 cudaDev 6 busId 1d0 - Abort COMPLETE
[1,13]<stdout>:bert-mpi-training-worker-1:26:666 [0] NCCL INFO comm 0x5591c7c2f340 rank 13 nranks 32 cudaDev 5 busId 1c0 - Abort COMPLETE
[1,12]<stdout>:bert-mpi-training-worker-1:25:663 [0] NCCL INFO comm 0x557b453cac40 rank 12 nranks 32 cudaDev 4 busId 1b0 - Abort COMPLETE
[1,8]<stdout>:bert-mpi-training-worker-1:21:669 [0] NCCL INFO comm 0x5648e3019c80 rank 8 nranks 32 cudaDev 0 busId 170 - Abort COMPLETE
[1,1]<stdout>:bert-mpi-training-worker-0:22:582 [1] NCCL INFO [Service thread] Connection closed by localRank 0
[1,0]<stdout>:bert-mpi-training-worker-0:21:583 [0] NCCL INFO [Service thread] Connection closed by localRank 0
[1,4]<stdout>:bert-mpi-training-worker-0:25:589 [4] NCCL INFO [Service thread] Connection closed by localRank 0
[1,1]<stdout>:bert-mpi-training-worker-0:22:670 [0] NCCL INFO comm 0x559fe7380640 rank 1 nranks 32 cudaDev 1 busId 180 - Abort COMPLETE
[1,0]<stdout>:bert-mpi-training-worker-0:21:583 [0] NCCL INFO [Service thread] Connection closed by localRank 4
[1,6]<stdout>:bert-mpi-training-worker-0:27:579 [6] NCCL INFO [Service thread] Connection closed by localRank 4
[1,5]<stdout>:bert-mpi-training-worker-0:26:585 [5] NCCL INFO [Service thread] Connection closed by localRank 4
[1,4]<stdout>:bert-mpi-training-worker-0:25:589 [4] NCCL INFO [Service thread] Connection closed by localRank 4
[1,6]<stdout>:bert-mpi-training-worker-0:27:671 [0] NCCL INFO comm 0x5640d3700dc0 rank 6 nranks 32 cudaDev 6 busId 1d0 - Abort COMPLETE
[1,5]<stdout>:bert-mpi-training-worker-0:26:669 [0] NCCL INFO comm 0x564c64f91780 rank 5 nranks 32 cudaDev 5 busId 1c0 - Abort COMPLETE
[1,4]<stdout>:bert-mpi-training-worker-0:25:673 [0] NCCL INFO comm 0x5641c3f88e40 rank 4 nranks 32 cudaDev 4 busId 1b0 - Abort COMPLETE
[1,0]<stdout>:bert-mpi-training-worker-0:21:667 [0] NCCL INFO comm 0x55bc31870680 rank 0 nranks 32 cudaDev 0 busId 170 - Abort COMPLETE

@mattcjo mattcjo marked this pull request as ready for review July 19, 2024 21:15
Comment on lines 123 to 124
# TODO: Consider parameterizing for nodes of any GPU count
num_gpus_per_node = 8 # Adjust this based on your setup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to do this now right, in case we want to call a job with a different instance type entirely? i guess the testing instances have been the same capacity so far

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, at the moment all instance types requested have 8 GPUs. I don't really ever see us running the training test on any instance type with less than 8 GPUs, unless EC2 suddenly changes their pattern for instance configurations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would future proof, but also add a touch more complexity and another point of failure if we have to have it configured at runtime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ndbaker1 I'd actually probably advocate I take out the TODO comment, and just leave it hard coded. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was thinking we could just add an

ENV GPUS_PER_NODE=8

in the Dockerfile and it wouldn't complicate much right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that works. I'm thinking further down stream (upstream..?) for parameterizitation of the manifest, and collection of the number of GPUs on a node by our Go test.

Copy link
Contributor Author

@mattcjo mattcjo Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So would really need to be something like

ARG GPUS_PER_NODE=8

unless I'm misunderstanding something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ndbaker1 Made the change, verified with local testing

@ndbaker1 ndbaker1 merged commit b133519 into aws:main Aug 1, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants