Add docker image for BERT e2e training task #454

mattcjo · 2024-06-26T21:31:33Z

Issue #, if available:

Description of changes:
A distributed training script (e2e2/test/images/bert-training/train.py) has been added, along with it's dependencies (e2e2/test/images/bert-training/requirements.txt), in a new docker file (e2e2/test/images/bert-training/Dockerfile.bert-training). Building the dockerfile will produce an image that will run a distributed BERT training job.

The testing of the docker image took place on a p3.16xlarge instance utilizing the AMI: ami-05e885690ca33b527. The goal of the image is to start a process per GPU, creating an isolated training process per GPU, and then for there to be communication between each process, consolidating the weights from each process. The test is ran for a single epoch.

The results of the test show that the running the docker image starts up and executes the distributed BERT training job as expected :

export MASTER_ADDR='localhost'
export MASTER_PORT='12355'

docker run --gpus all --rm -e MASTER_ADDR -e MASTER_PORT aws-bert-mpi-training:latest mpirun -np 8 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x MASTER_ADDR -x MASTER_PORT --allow-run-as-root python train.py 

==========
== CUDA ==
==========

CUDA Version 12.5.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

/usr/local/lib/python3.11/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/usr/local/lib/python3.11/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/usr/local/lib/python3.11/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/usr/local/lib/python3.11/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/usr/local/lib/python3.11/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/usr/local/lib/python3.11/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/usr/local/lib/python3.11/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/usr/local/lib/python3.11/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
[W socket.cpp:697] [c10d] The client socket has failed to connect to [localhost]:12355 (errno: 99 - Cannot assign requested address).
[W socket.cpp:697] [c10d] The client socket has failed to connect to [localhost]:12355 (errno: 99 - Cannot assign requested address).
Process 2 initialized, using GPU 2
Process 5 initialized, using GPU 5
Process 6 initialized, using GPU 6
Process 3 initialized, using GPU 3
Process 4 initialized, using GPU 4
Process 0 initialized, using GPU 0
Process 7 initialized, using GPU 7
Process 1 initialized, using GPU 1
b391fedc46b4:32:32 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
b391fedc46b4:32:32 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
b391fedc46b4:32:32 [0] NCCL INFO cudaDriverVersion 12050
NCCL version 2.20.5+cuda12.4
b391fedc46b4:36:36 [4] NCCL INFO cudaDriverVersion 12050
b391fedc46b4:36:36 [4] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
b391fedc46b4:38:38 [6] NCCL INFO cudaDriverVersion 12050
b391fedc46b4:36:36 [4] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
b391fedc46b4:35:35 [3] NCCL INFO cudaDriverVersion 12050
b391fedc46b4:38:38 [6] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
b391fedc46b4:37:37 [5] NCCL INFO cudaDriverVersion 12050
b391fedc46b4:38:38 [6] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
b391fedc46b4:37:37 [5] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
b391fedc46b4:35:35 [3] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
b391fedc46b4:35:35 [3] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
b391fedc46b4:37:37 [5] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
b391fedc46b4:34:34 [2] NCCL INFO cudaDriverVersion 12050
b391fedc46b4:34:34 [2] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
b391fedc46b4:34:34 [2] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
b391fedc46b4:39:39 [7] NCCL INFO cudaDriverVersion 12050
b391fedc46b4:39:39 [7] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
b391fedc46b4:39:39 [7] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
b391fedc46b4:32:75 [0] NCCL INFO NET/IB : No device found.
b391fedc46b4:32:75 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
b391fedc46b4:32:75 [0] NCCL INFO Using non-device net plugin version 0
b391fedc46b4:32:75 [0] NCCL INFO Using network Socket
b391fedc46b4:33:33 [1] NCCL INFO cudaDriverVersion 12050
b391fedc46b4:33:33 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
b391fedc46b4:33:33 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
b391fedc46b4:36:76 [4] NCCL INFO NET/IB : No device found.
b391fedc46b4:36:76 [4] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
b391fedc46b4:36:76 [4] NCCL INFO Using non-device net plugin version 0
b391fedc46b4:36:76 [4] NCCL INFO Using network Socket
b391fedc46b4:35:78 [3] NCCL INFO NET/IB : No device found.
b391fedc46b4:35:78 [3] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
b391fedc46b4:35:78 [3] NCCL INFO Using non-device net plugin version 0
b391fedc46b4:35:78 [3] NCCL INFO Using network Socket
b391fedc46b4:38:77 [6] NCCL INFO NET/IB : No device found.
b391fedc46b4:38:77 [6] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
b391fedc46b4:38:77 [6] NCCL INFO Using non-device net plugin version 0
b391fedc46b4:38:77 [6] NCCL INFO Using network Socket
b391fedc46b4:37:79 [5] NCCL INFO NET/IB : No device found.
b391fedc46b4:37:79 [5] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
b391fedc46b4:37:79 [5] NCCL INFO Using non-device net plugin version 0
b391fedc46b4:37:79 [5] NCCL INFO Using network Socket
b391fedc46b4:34:80 [2] NCCL INFO NET/IB : No device found.
b391fedc46b4:34:80 [2] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
b391fedc46b4:34:80 [2] NCCL INFO Using non-device net plugin version 0
b391fedc46b4:34:80 [2] NCCL INFO Using network Socket
b391fedc46b4:39:81 [7] NCCL INFO NET/IB : No device found.
b391fedc46b4:39:81 [7] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
b391fedc46b4:39:81 [7] NCCL INFO Using non-device net plugin version 0
b391fedc46b4:39:81 [7] NCCL INFO Using network Socket
b391fedc46b4:33:82 [1] NCCL INFO NET/IB : No device found.
b391fedc46b4:33:82 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
b391fedc46b4:33:82 [1] NCCL INFO Using non-device net plugin version 0
b391fedc46b4:33:82 [1] NCCL INFO Using network Socket
b391fedc46b4:34:80 [2] NCCL INFO comm 0x8a3b540 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 190 commId 0xdd45beef26d12c93 - Init START
b391fedc46b4:37:79 [5] NCCL INFO comm 0x9c86040 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId 1c0 commId 0xdd45beef26d12c93 - Init START
b391fedc46b4:36:76 [4] NCCL INFO comm 0x8c08d40 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 1b0 commId 0xdd45beef26d12c93 - Init START
b391fedc46b4:39:81 [7] NCCL INFO comm 0x147271f0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId 1e0 commId 0xdd45beef26d12c93 - Init START
b391fedc46b4:38:77 [6] NCCL INFO comm 0x95cdea0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId 1d0 commId 0xdd45beef26d12c93 - Init START
b391fedc46b4:35:78 [3] NCCL INFO comm 0x9574d80 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 1a0 commId 0xdd45beef26d12c93 - Init START
b391fedc46b4:32:75 [0] NCCL INFO comm 0x1ddf1700 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 170 commId 0xdd45beef26d12c93 - Init START
b391fedc46b4:33:82 [1] NCCL INFO comm 0xa0b0200 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 180 commId 0xdd45beef26d12c93 - Init START
b391fedc46b4:32:75 [0] NCCL INFO NVLS multicast support is not available on dev 0
b391fedc46b4:37:79 [5] NCCL INFO NVLS multicast support is not available on dev 5
b391fedc46b4:36:76 [4] NCCL INFO NVLS multicast support is not available on dev 4
b391fedc46b4:33:82 [1] NCCL INFO NVLS multicast support is not available on dev 1
b391fedc46b4:39:81 [7] NCCL INFO NVLS multicast support is not available on dev 7
b391fedc46b4:38:77 [6] NCCL INFO NVLS multicast support is not available on dev 6
b391fedc46b4:35:78 [3] NCCL INFO NVLS multicast support is not available on dev 3
b391fedc46b4:34:80 [2] NCCL INFO NVLS multicast support is not available on dev 2
b391fedc46b4:32:75 [0] NCCL INFO comm 0x1ddf1700 rank 0 nRanks 8 nNodes 1 localRanks 8 localRank 0 MNNVL 0
b391fedc46b4:32:75 [0] NCCL INFO Channel 00/12 :    0   3   2   1   5   6   7   4
b391fedc46b4:32:75 [0] NCCL INFO Channel 01/12 :    0   3   2   1   5   6   7   4
b391fedc46b4:32:75 [0] NCCL INFO Channel 02/12 :    0   4   7   6   5   1   2   3
b391fedc46b4:32:75 [0] NCCL INFO Channel 03/12 :    0   4   7   6   5   1   2   3
b391fedc46b4:32:75 [0] NCCL INFO Channel 04/12 :    0   1   3   7   5   4   6   2
b391fedc46b4:37:79 [5] NCCL INFO comm 0x9c86040 rank 5 nRanks 8 nNodes 1 localRanks 8 localRank 5 MNNVL 0
b391fedc46b4:37:79 [5] NCCL INFO Trees [0] 6/-1/-1->5->1 [1] 6/-1/-1->5->1 [2] 1/-1/-1->5->6 [3] 1/-1/-1->5->6 [4] 4/-1/-1->5->7 [5] 7/-1/-1->5->4 [6] 6/-1/-1->5->1 [7] 6/-1/-1->5->1 [8] 1/-1/-1->5->6 [9] 1/-1/-1->5->6 [10] 4/-1/-1->5->7 [11] 7/-1/-1->5->4
b391fedc46b4:37:79 [5] NCCL INFO P2P Chunksize set to 524288
b391fedc46b4:33:82 [1] NCCL INFO comm 0xa0b0200 rank 1 nRanks 8 nNodes 1 localRanks 8 localRank 1 MNNVL 0
b391fedc46b4:33:82 [1] NCCL INFO Trees [0] 5/-1/-1->1->2 [1] 5/-1/-1->1->2 [2] 2/-1/-1->1->5 [3] 2/-1/-1->1->5 [4] 3/-1/-1->1->0 [5] -1/-1/-1->1->3 [6] 5/-1/-1->1->2 [7] 5/-1/-1->1->2 [8] 2/-1/-1->1->5 [9] 2/-1/-1->1->5 [10] 3/-1/-1->1->0 [11] -1/-1/-1->1->3
b391fedc46b4:33:82 [1] NCCL INFO P2P Chunksize set to 524288
b391fedc46b4:39:81 [7] NCCL INFO comm 0x147271f0 rank 7 nRanks 8 nNodes 1 localRanks 8 localRank 7 MNNVL 0
b391fedc46b4:39:81 [7] NCCL INFO Trees [0] 4/-1/-1->7->6 [1] 4/-1/-1->7->6 [2] 6/-1/-1->7->4 [3] 6/-1/-1->7->4 [4] 5/-1/-1->7->3 [5] 3/-1/-1->7->5 [6] 4/-1/-1->7->6 [7] 4/-1/-1->7->6 [8] 6/-1/-1->7->4 [9] 6/-1/-1->7->4 [10] 5/-1/-1->7->3 [11] 3/-1/-1->7->5
b391fedc46b4:39:81 [7] NCCL INFO P2P Chunksize set to 524288
b391fedc46b4:38:77 [6] NCCL INFO comm 0x95cdea0 rank 6 nRanks 8 nNodes 1 localRanks 8 localRank 6 MNNVL 0
b391fedc46b4:38:77 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 5/-1/-1->6->7 [3] 5/-1/-1->6->7 [4] 2/-1/-1->6->4 [5] 4/-1/-1->6->2 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 5/-1/-1->6->7 [9] 5/-1/-1->6->7 [10] 2/-1/-1->6->4 [11] 4/-1/-1->6->2
b391fedc46b4:38:77 [6] NCCL INFO P2P Chunksize set to 524288
b391fedc46b4:34:80 [2] NCCL INFO comm 0x8a3b540 rank 2 nRanks 8 nNodes 1 localRanks 8 localRank 2 MNNVL 0
b391fedc46b4:34:80 [2] NCCL INFO Trees [0] 1/-1/-1->2->3 [1] 1/-1/-1->2->3 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] -1/-1/-1->2->6 [5] 6/-1/-1->2->0 [6] 1/-1/-1->2->3 [7] 1/-1/-1->2->3 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] -1/-1/-1->2->6 [11] 6/-1/-1->2->0
b391fedc46b4:34:80 [2] NCCL INFO P2P Chunksize set to 524288
b391fedc46b4:36:76 [4] NCCL INFO comm 0x8c08d40 rank 4 nRanks 8 nNodes 1 localRanks 8 localRank 4 MNNVL 0
b391fedc46b4:36:76 [4] NCCL INFO Trees [0] -1/-1/-1->4->7 [1] -1/-1/-1->4->7 [2] 7/-1/-1->4->0 [3] 7/-1/-1->4->0 [4] 6/-1/-1->4->5 [5] 5/-1/-1->4->6 [6] -1/-1/-1->4->7 [7] -1/-1/-1->4->7 [8] 7/-1/-1->4->0 [9] 7/-1/-1->4->0 [10] 6/-1/-1->4->5 [11] 5/-1/-1->4->6
b391fedc46b4:36:76 [4] NCCL INFO P2P Chunksize set to 524288
b391fedc46b4:35:78 [3] NCCL INFO comm 0x9574d80 rank 3 nRanks 8 nNodes 1 localRanks 8 localRank 3 MNNVL 0
b391fedc46b4:35:78 [3] NCCL INFO Trees [0] 2/-1/-1->3->0 [1] 2/-1/-1->3->0 [2] -1/-1/-1->3->2 [3] -1/-1/-1->3->2 [4] 7/-1/-1->3->1 [5] 1/-1/-1->3->7 [6] 2/-1/-1->3->0 [7] 2/-1/-1->3->0 [8] -1/-1/-1->3->2 [9] -1/-1/-1->3->2 [10] 7/-1/-1->3->1 [11] 1/-1/-1->3->7
b391fedc46b4:35:78 [3] NCCL INFO P2P Chunksize set to 524288
b391fedc46b4:32:75 [0] NCCL INFO Channel 05/12 :    0   2   6   4   5   7   3   1
b391fedc46b4:32:75 [0] NCCL INFO Channel 06/12 :    0   3   2   1   5   6   7   4
b391fedc46b4:32:75 [0] NCCL INFO Channel 07/12 :    0   3   2   1   5   6   7   4
b391fedc46b4:32:75 [0] NCCL INFO Channel 08/12 :    0   4   7   6   5   1   2   3
b391fedc46b4:32:75 [0] NCCL INFO Channel 09/12 :    0   4   7   6   5   1   2   3
b391fedc46b4:32:75 [0] NCCL INFO Channel 10/12 :    0   1   3   7   5   4   6   2
b391fedc46b4:32:75 [0] NCCL INFO Channel 11/12 :    0   2   6   4   5   7   3   1
b391fedc46b4:32:75 [0] NCCL INFO Trees [0] 3/-1/-1->0->-1 [1] 3/-1/-1->0->-1 [2] 4/-1/-1->0->-1 [3] 4/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 2/-1/-1->0->-1 [6] 3/-1/-1->0->-1 [7] 3/-1/-1->0->-1 [8] 4/-1/-1->0->-1 [9] 4/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 2/-1/-1->0->-1
b391fedc46b4:32:75 [0] NCCL INFO P2P Chunksize set to 524288
b391fedc46b4:36:76 [4] NCCL INFO Channel 05/0 : 4[4] -> 5[5] via P2P/CUMEM
b391fedc46b4:32:75 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 11/0 : 4[4] -> 5[5] via P2P/CUMEM
b391fedc46b4:32:75 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 06/0 : 6[6] -> 7[7] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 08/0 : 2[2] -> 3[3] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 07/0 : 6[6] -> 7[7] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 09/0 : 2[2] -> 3[3] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 06/0 : 5[5] -> 6[6] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 02/0 : 1[1] -> 2[2] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 03/0 : 1[1] -> 2[2] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 04/0 : 4[4] -> 6[6] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 05/0 : 5[5] -> 7[7] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 08/0 : 1[1] -> 2[2] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 10/0 : 4[4] -> 6[6] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 11/0 : 5[5] -> 7[7] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 09/0 : 1[1] -> 2[2] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 02/0 : 4[4] -> 7[7] via P2P/CUMEM
b391fedc46b4:32:75 [0] NCCL INFO Channel 05/0 : 0[0] -> 2[2] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 04/0 : 1[1] -> 3[3] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 03/0 : 4[4] -> 7[7] via P2P/CUMEM
b391fedc46b4:32:75 [0] NCCL INFO Channel 11/0 : 0[0] -> 2[2] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 10/0 : 1[1] -> 3[3] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 08/0 : 4[4] -> 7[7] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 04/0 : 6[6] -> 2[2] via P2P/CUMEM
b391fedc46b4:32:75 [0] NCCL INFO Channel 00/0 : 0[0] -> 3[3] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 09/0 : 4[4] -> 7[7] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 10/0 : 6[6] -> 2[2] via P2P/CUMEM
b391fedc46b4:32:75 [0] NCCL INFO Channel 01/0 : 0[0] -> 3[3] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 02/0 : 5[5] -> 1[1] via P2P/CUMEM
b391fedc46b4:32:75 [0] NCCL INFO Channel 06/0 : 0[0] -> 3[3] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 05/0 : 2[2] -> 6[6] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 03/0 : 5[5] -> 1[1] via P2P/CUMEM
b391fedc46b4:32:75 [0] NCCL INFO Channel 07/0 : 0[0] -> 3[3] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 11/0 : 2[2] -> 6[6] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 08/0 : 5[5] -> 1[1] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 05/0 : 7[7] -> 3[3] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 00/0 : 1[1] -> 5[5] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 05/0 : 6[6] -> 4[4] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 04/0 : 2[2] -> 0[0] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 09/0 : 5[5] -> 1[1] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 11/0 : 7[7] -> 3[3] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 01/0 : 1[1] -> 5[5] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 11/0 : 6[6] -> 4[4] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 10/0 : 2[2] -> 0[0] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 06/0 : 1[1] -> 5[5] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 00/0 : 4[4] -> 0[0] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 04/0 : 3[3] -> 7[7] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 07/0 : 1[1] -> 5[5] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 01/0 : 4[4] -> 0[0] via P2P/CUMEM
b391fedc46b4:32:75 [0] NCCL INFO Channel 02/0 : 0[0] -> 4[4] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 10/0 : 3[3] -> 7[7] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 06/0 : 4[4] -> 0[0] via P2P/CUMEM
b391fedc46b4:32:75 [0] NCCL INFO Channel 03/0 : 0[0] -> 4[4] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 00/0 : 7[7] -> 4[4] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 02/0 : 3[3] -> 0[0] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 07/0 : 4[4] -> 0[0] via P2P/CUMEM
b391fedc46b4:32:75 [0] NCCL INFO Channel 08/0 : 0[0] -> 4[4] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 01/0 : 7[7] -> 4[4] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 03/0 : 3[3] -> 0[0] via P2P/CUMEM
b391fedc46b4:32:75 [0] NCCL INFO Channel 09/0 : 0[0] -> 4[4] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 06/0 : 7[7] -> 4[4] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 08/0 : 3[3] -> 0[0] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 07/0 : 7[7] -> 4[4] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 09/0 : 3[3] -> 0[0] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 04/0 : 7[7] -> 5[5] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 05/0 : 3[3] -> 1[1] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 10/0 : 7[7] -> 5[5] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 11/0 : 3[3] -> 1[1] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 02/0 : 7[7] -> 6[6] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 03/0 : 7[7] -> 6[6] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 08/0 : 7[7] -> 6[6] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 06/0 : 3[3] -> 2[2] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 09/0 : 7[7] -> 6[6] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 07/0 : 3[3] -> 2[2] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 04/0 : 5[5] -> 4[4] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 02/0 : 6[6] -> 5[5] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 10/0 : 5[5] -> 4[4] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 03/0 : 6[6] -> 5[5] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 11/0 : 1[1] -> 0[0] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 08/0 : 6[6] -> 5[5] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 06/0 : 2[2] -> 1[1] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 09/0 : 6[6] -> 5[5] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 07/0 : 2[2] -> 1[1] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Connected all rings
b391fedc46b4:36:76 [4] NCCL INFO Channel 04/0 : 4[4] -> 5[5] via P2P/CUMEM
b391fedc46b4:32:75 [0] NCCL INFO Connected all rings
b391fedc46b4:39:81 [7] NCCL INFO Connected all rings
b391fedc46b4:33:82 [1] NCCL INFO Connected all rings
b391fedc46b4:33:82 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Connected all rings
b391fedc46b4:35:78 [3] NCCL INFO Connected all rings
b391fedc46b4:37:79 [5] NCCL INFO Connected all rings
b391fedc46b4:38:77 [6] NCCL INFO Connected all rings
b391fedc46b4:36:76 [4] NCCL INFO Channel 10/0 : 4[4] -> 5[5] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 02/0 : 5[5] -> 6[6] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 03/0 : 5[5] -> 6[6] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 02/0 : 6[6] -> 7[7] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 08/0 : 5[5] -> 6[6] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 03/0 : 6[6] -> 7[7] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 09/0 : 5[5] -> 6[6] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 06/0 : 1[1] -> 2[2] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 08/0 : 6[6] -> 7[7] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 05/0 : 4[4] -> 6[6] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 07/0 : 1[1] -> 2[2] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 09/0 : 6[6] -> 7[7] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 11/0 : 4[4] -> 6[6] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 04/0 : 5[5] -> 7[7] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 05/0 : 1[1] -> 3[3] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 10/0 : 5[5] -> 7[7] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 11/0 : 1[1] -> 3[3] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 00/0 : 4[4] -> 7[7] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 04/0 : 2[2] -> 6[6] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 01/0 : 4[4] -> 7[7] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 10/0 : 2[2] -> 6[6] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 05/0 : 6[6] -> 2[2] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 06/0 : 4[4] -> 7[7] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 05/0 : 3[3] -> 7[7] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 11/0 : 6[6] -> 2[2] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 07/0 : 4[4] -> 7[7] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 11/0 : 3[3] -> 7[7] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 04/0 : 6[6] -> 4[4] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 05/0 : 2[2] -> 0[0] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 02/0 : 4[4] -> 0[0] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 00/0 : 5[5] -> 1[1] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 02/0 : 1[1] -> 5[5] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 10/0 : 6[6] -> 4[4] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 11/0 : 2[2] -> 0[0] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 03/0 : 4[4] -> 0[0] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 01/0 : 5[5] -> 1[1] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 03/0 : 1[1] -> 5[5] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 04/0 : 7[7] -> 3[3] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 08/0 : 4[4] -> 0[0] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 06/0 : 5[5] -> 1[1] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 08/0 : 1[1] -> 5[5] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 10/0 : 7[7] -> 3[3] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Channel 09/0 : 4[4] -> 0[0] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 07/0 : 5[5] -> 1[1] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 09/0 : 1[1] -> 5[5] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 00/0 : 3[3] -> 0[0] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 02/0 : 7[7] -> 4[4] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 01/0 : 3[3] -> 0[0] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 03/0 : 7[7] -> 4[4] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 06/0 : 3[3] -> 0[0] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 08/0 : 7[7] -> 4[4] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 07/0 : 3[3] -> 0[0] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 09/0 : 7[7] -> 4[4] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 05/0 : 7[7] -> 5[5] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 04/0 : 3[3] -> 1[1] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 11/0 : 7[7] -> 5[5] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 10/0 : 3[3] -> 1[1] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 00/0 : 7[7] -> 6[6] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 02/0 : 3[3] -> 2[2] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 01/0 : 7[7] -> 6[6] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 03/0 : 3[3] -> 2[2] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 06/0 : 7[7] -> 6[6] via P2P/CUMEM
b391fedc46b4:39:81 [7] NCCL INFO Channel 07/0 : 7[7] -> 6[6] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 05/0 : 5[5] -> 4[4] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/CUMEM
b391fedc46b4:37:79 [5] NCCL INFO Channel 11/0 : 5[5] -> 4[4] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 08/0 : 3[3] -> 2[2] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/CUMEM
b391fedc46b4:33:82 [1] NCCL INFO Channel 10/0 : 1[1] -> 0[0] via P2P/CUMEM
b391fedc46b4:35:78 [3] NCCL INFO Channel 09/0 : 3[3] -> 2[2] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 06/0 : 6[6] -> 5[5] via P2P/CUMEM
b391fedc46b4:38:77 [6] NCCL INFO Channel 07/0 : 6[6] -> 5[5] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 02/0 : 2[2] -> 1[1] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 03/0 : 2[2] -> 1[1] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 08/0 : 2[2] -> 1[1] via P2P/CUMEM
b391fedc46b4:34:80 [2] NCCL INFO Channel 09/0 : 2[2] -> 1[1] via P2P/CUMEM
b391fedc46b4:36:76 [4] NCCL INFO Connected all trees
b391fedc46b4:36:76 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
b391fedc46b4:36:76 [4] NCCL INFO 12 coll channels, 0 collnet channels, 0 nvls channels, 16 p2p channels, 2 p2p channels per peer
b391fedc46b4:39:81 [7] NCCL INFO Connected all trees
b391fedc46b4:39:81 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
b391fedc46b4:39:81 [7] NCCL INFO 12 coll channels, 0 collnet channels, 0 nvls channels, 16 p2p channels, 2 p2p channels per peer
b391fedc46b4:39:81 [7] NCCL INFO Channel 08/1 : 7[7] -> 0[0] via P2P/indirect/4[4]
b391fedc46b4:38:77 [6] NCCL INFO Connected all trees
b391fedc46b4:38:77 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
b391fedc46b4:38:77 [6] NCCL INFO 12 coll channels, 0 collnet channels, 0 nvls channels, 16 p2p channels, 2 p2p channels per peer
b391fedc46b4:38:77 [6] NCCL INFO Channel 04/1 : 6[6] -> 0[0] via P2P/indirect/4[4]
b391fedc46b4:37:79 [5] NCCL INFO Connected all trees
b391fedc46b4:37:79 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
b391fedc46b4:37:79 [5] NCCL INFO 12 coll channels, 0 collnet channels, 0 nvls channels, 16 p2p channels, 2 p2p channels per peer
b391fedc46b4:39:81 [7] NCCL INFO Channel 09/1 : 7[7] -> 0[0] via P2P/indirect/4[4]
b391fedc46b4:32:75 [0] NCCL INFO Connected all trees
b391fedc46b4:33:82 [1] NCCL INFO Connected all trees
b391fedc46b4:33:82 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
b391fedc46b4:33:82 [1] NCCL INFO 12 coll channels, 0 collnet channels, 0 nvls channels, 16 p2p channels, 2 p2p channels per peer
b391fedc46b4:35:78 [3] NCCL INFO Connected all trees
b391fedc46b4:35:78 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
b391fedc46b4:35:78 [3] NCCL INFO 12 coll channels, 0 collnet channels, 0 nvls channels, 16 p2p channels, 2 p2p channels per peer
b391fedc46b4:32:75 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
b391fedc46b4:32:75 [0] NCCL INFO 12 coll channels, 0 collnet channels, 0 nvls channels, 16 p2p channels, 2 p2p channels per peer
b391fedc46b4:34:80 [2] NCCL INFO Connected all trees
b391fedc46b4:35:78 [3] NCCL INFO Channel 08/1 : 3[3] -> 4[4] via P2P/indirect/0[0]
b391fedc46b4:34:80 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
b391fedc46b4:34:80 [2] NCCL INFO 12 coll channels, 0 collnet channels, 0 nvls channels, 16 p2p channels, 2 p2p channels per peer
b391fedc46b4:34:80 [2] NCCL INFO Channel 04/1 : 2[2] -> 4[4] via P2P/indirect/0[0]
b391fedc46b4:38:77 [6] NCCL INFO Channel 05/1 : 6[6] -> 0[0] via P2P/indirect/4[4]
b391fedc46b4:35:78 [3] NCCL INFO Channel 09/1 : 3[3] -> 4[4] via P2P/indirect/0[0]
b391fedc46b4:34:80 [2] NCCL INFO Channel 05/1 : 2[2] -> 4[4] via P2P/indirect/0[0]
b391fedc46b4:35:78 [3] NCCL INFO Channel 04/1 : 3[3] -> 5[5] via P2P/indirect/1[1]
b391fedc46b4:39:81 [7] NCCL INFO Channel 04/1 : 7[7] -> 1[1] via P2P/indirect/5[5]
b391fedc46b4:35:78 [3] NCCL INFO Channel 05/1 : 3[3] -> 5[5] via P2P/indirect/1[1]
b391fedc46b4:39:81 [7] NCCL INFO Channel 05/1 : 7[7] -> 1[1] via P2P/indirect/5[5]
b391fedc46b4:35:78 [3] NCCL INFO Channel 12/1 : 3[3] -> 6[6] via P2P/indirect/7[7]
b391fedc46b4:39:81 [7] NCCL INFO Channel 12/1 : 7[7] -> 2[2] via P2P/indirect/3[3]
b391fedc46b4:35:78 [3] NCCL INFO Channel 13/1 : 3[3] -> 6[6] via P2P/indirect/7[7]
b391fedc46b4:39:81 [7] NCCL INFO Channel 13/1 : 7[7] -> 2[2] via P2P/indirect/3[3]
b391fedc46b4:34:80 [2] NCCL INFO Channel 12/1 : 2[2] -> 5[5] via P2P/indirect/1[1]
b391fedc46b4:37:79 [5] NCCL INFO Channel 12/1 : 5[5] -> 0[0] via P2P/indirect/4[4]
b391fedc46b4:38:77 [6] NCCL INFO Channel 12/1 : 6[6] -> 1[1] via P2P/indirect/5[5]
b391fedc46b4:33:82 [1] NCCL INFO Channel 12/1 : 1[1] -> 4[4] via P2P/indirect/0[0]
b391fedc46b4:38:77 [6] NCCL INFO Channel 13/1 : 6[6] -> 1[1] via P2P/indirect/5[5]
b391fedc46b4:34:80 [2] NCCL INFO Channel 13/1 : 2[2] -> 5[5] via P2P/indirect/1[1]
b391fedc46b4:37:79 [5] NCCL INFO Channel 13/1 : 5[5] -> 0[0] via P2P/indirect/4[4]
b391fedc46b4:33:82 [1] NCCL INFO Channel 13/1 : 1[1] -> 4[4] via P2P/indirect/0[0]
b391fedc46b4:32:75 [0] NCCL INFO Channel 10/1 : 0[0] -> 5[5] via P2P/indirect/1[1]
b391fedc46b4:36:76 [4] NCCL INFO Channel 10/1 : 4[4] -> 1[1] via P2P/indirect/5[5]
b391fedc46b4:36:76 [4] NCCL INFO Channel 11/1 : 4[4] -> 1[1] via P2P/indirect/5[5]
b391fedc46b4:32:75 [0] NCCL INFO Channel 11/1 : 0[0] -> 5[5] via P2P/indirect/1[1]
b391fedc46b4:34:80 [2] NCCL INFO Channel 10/1 : 2[2] -> 7[7] via P2P/indirect/6[6]
b391fedc46b4:33:82 [1] NCCL INFO Channel 10/1 : 1[1] -> 6[6] via P2P/indirect/5[5]
b391fedc46b4:37:79 [5] NCCL INFO Channel 10/1 : 5[5] -> 2[2] via P2P/indirect/1[1]
b391fedc46b4:38:77 [6] NCCL INFO Channel 10/1 : 6[6] -> 3[3] via P2P/indirect/2[2]
b391fedc46b4:34:80 [2] NCCL INFO Channel 11/1 : 2[2] -> 7[7] via P2P/indirect/6[6]
b391fedc46b4:37:79 [5] NCCL INFO Channel 11/1 : 5[5] -> 2[2] via P2P/indirect/1[1]
b391fedc46b4:33:82 [1] NCCL INFO Channel 11/1 : 1[1] -> 6[6] via P2P/indirect/5[5]
b391fedc46b4:38:77 [6] NCCL INFO Channel 11/1 : 6[6] -> 3[3] via P2P/indirect/2[2]
b391fedc46b4:32:75 [0] NCCL INFO Channel 06/1 : 0[0] -> 6[6] via P2P/indirect/4[4]
b391fedc46b4:37:79 [5] NCCL INFO Channel 06/1 : 5[5] -> 3[3] via P2P/indirect/1[1]
b391fedc46b4:36:76 [4] NCCL INFO Channel 06/1 : 4[4] -> 2[2] via P2P/indirect/6[6]
b391fedc46b4:33:82 [1] NCCL INFO Channel 06/1 : 1[1] -> 7[7] via P2P/indirect/3[3]
b391fedc46b4:32:75 [0] NCCL INFO Channel 07/1 : 0[0] -> 6[6] via P2P/indirect/4[4]
b391fedc46b4:33:82 [1] NCCL INFO Channel 07/1 : 1[1] -> 7[7] via P2P/indirect/3[3]
b391fedc46b4:37:79 [5] NCCL INFO Channel 07/1 : 5[5] -> 3[3] via P2P/indirect/1[1]
b391fedc46b4:36:76 [4] NCCL INFO Channel 07/1 : 4[4] -> 2[2] via P2P/indirect/6[6]
b391fedc46b4:32:75 [0] NCCL INFO Channel 14/1 : 0[0] -> 7[7] via P2P/indirect/4[4]
b391fedc46b4:36:76 [4] NCCL INFO Channel 14/1 : 4[4] -> 3[3] via P2P/indirect/0[0]
b391fedc46b4:36:76 [4] NCCL INFO Channel 15/1 : 4[4] -> 3[3] via P2P/indirect/0[0]
b391fedc46b4:32:75 [0] NCCL INFO Channel 15/1 : 0[0] -> 7[7] via P2P/indirect/4[4]
b391fedc46b4:39:81 [7] NCCL INFO comm 0x147271f0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId 1e0 commId 0xdd45beef26d12c93 - Init COMPLETE
b391fedc46b4:35:78 [3] NCCL INFO comm 0x9574d80 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 1a0 commId 0xdd45beef26d12c93 - Init COMPLETE
b391fedc46b4:37:79 [5] NCCL INFO comm 0x9c86040 rank 5 nranks 8 cudaDev 5 nvmlDev 5 busId 1c0 commId 0xdd45beef26d12c93 - Init COMPLETE
b391fedc46b4:32:75 [0] NCCL INFO comm 0x1ddf1700 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 170 commId 0xdd45beef26d12c93 - Init COMPLETE
b391fedc46b4:33:82 [1] NCCL INFO comm 0xa0b0200 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 180 commId 0xdd45beef26d12c93 - Init COMPLETE
b391fedc46b4:36:76 [4] NCCL INFO comm 0x8c08d40 rank 4 nranks 8 cudaDev 4 nvmlDev 4 busId 1b0 commId 0xdd45beef26d12c93 - Init COMPLETE
b391fedc46b4:34:80 [2] NCCL INFO comm 0x8a3b540 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 190 commId 0xdd45beef26d12c93 - Init COMPLETE
b391fedc46b4:38:77 [6] NCCL INFO comm 0x95cdea0 rank 6 nranks 8 cudaDev 6 nvmlDev 6 busId 1d0 commId 0xdd45beef26d12c93 - Init COMPLETE
Process 3 - Training time: 0.40 seconds
Process 3 - Throughput: 251.42 samples/second
Process 0 - Training time: 0.40 seconds
Process 0 - Throughput: 247.30 samples/second
Process 7 - Training time: 0.43 seconds
Process 7 - Throughput: 232.12 samples/second
Process 5 - Training time: 0.43 seconds
Process 5 - Throughput: 234.14 samples/second
Process 1 - Training time: 0.43 seconds
Process 1 - Throughput: 231.37 samples/second
Process 4 - Training time: 0.44 seconds
Process 4 - Throughput: 229.70 samples/second
Process 6 - Training time: 0.43 seconds
Process 6 - Throughput: 233.03 samples/second
Process 2 - Training time: 0.44 seconds
Process 2 - Throughput: 229.34 samples/second
b391fedc46b4:32:91 [0] NCCL INFO [Service thread] Connection closed by localRank 0
b391fedc46b4:36:86 [4] NCCL INFO [Service thread] Connection closed by localRank 0
b391fedc46b4:33:97 [1] NCCL INFO [Service thread] Connection closed by localRank 0
b391fedc46b4:35:93 [3] NCCL INFO [Service thread] Connection closed by localRank 7
b391fedc46b4:37:95 [5] NCCL INFO [Service thread] Connection closed by localRank 7
b391fedc46b4:36:86 [4] NCCL INFO [Service thread] Connection closed by localRank 7
b391fedc46b4:39:83 [7] NCCL INFO [Service thread] Connection closed by localRank 7
b391fedc46b4:32:91 [0] NCCL INFO [Service thread] Connection closed by localRank 4
b391fedc46b4:36:86 [4] NCCL INFO [Service thread] Connection closed by localRank 4
b391fedc46b4:37:95 [5] NCCL INFO [Service thread] Connection closed by localRank 4
b391fedc46b4:38:85 [6] NCCL INFO [Service thread] Connection closed by localRank 4
b391fedc46b4:32:91 [0] NCCL INFO [Service thread] Connection closed by localRank 3
b391fedc46b4:33:97 [1] NCCL INFO [Service thread] Connection closed by localRank 3
b391fedc46b4:35:93 [3] NCCL INFO [Service thread] Connection closed by localRank 3
b391fedc46b4:39:83 [7] NCCL INFO [Service thread] Connection closed by localRank 3
b391fedc46b4:32:91 [0] NCCL INFO [Service thread] Connection closed by localRank 2
b391fedc46b4:33:97 [1] NCCL INFO [Service thread] Connection closed by localRank 2
b391fedc46b4:38:85 [6] NCCL INFO [Service thread] Connection closed by localRank 2
b391fedc46b4:34:89 [2] NCCL INFO [Service thread] Connection closed by localRank 2
b391fedc46b4:32:91 [0] NCCL INFO [Service thread] Connection closed by localRank 1
b391fedc46b4:33:97 [1] NCCL INFO [Service thread] Connection closed by localRank 1
b391fedc46b4:35:93 [3] NCCL INFO [Service thread] Connection closed by localRank 1
b391fedc46b4:34:89 [2] NCCL INFO [Service thread] Connection closed by localRank 6
b391fedc46b4:36:86 [4] NCCL INFO [Service thread] Connection closed by localRank 6
b391fedc46b4:37:95 [5] NCCL INFO [Service thread] Connection closed by localRank 1
b391fedc46b4:37:95 [5] NCCL INFO [Service thread] Connection closed by localRank 6
b391fedc46b4:38:85 [6] NCCL INFO [Service thread] Connection closed by localRank 6
b391fedc46b4:36:86 [4] NCCL INFO [Service thread] Connection closed by localRank 5
b391fedc46b4:33:97 [1] NCCL INFO [Service thread] Connection closed by localRank 5
b391fedc46b4:37:95 [5] NCCL INFO [Service thread] Connection closed by localRank 5
b391fedc46b4:39:174 [0] NCCL INFO comm 0x147271f0 rank 7 nranks 8 cudaDev 7 busId 1e0 - Abort COMPLETE
b391fedc46b4:32:172 [0] NCCL INFO comm 0x1ddf1700 rank 0 nranks 8 cudaDev 0 busId 170 - Abort COMPLETE
b391fedc46b4:35:171 [0] NCCL INFO comm 0x9574d80 rank 3 nranks 8 cudaDev 3 busId 1a0 - Abort COMPLETE
b391fedc46b4:34:178 [0] NCCL INFO comm 0x8a3b540 rank 2 nranks 8 cudaDev 2 busId 190 - Abort COMPLETE
b391fedc46b4:38:177 [0] NCCL INFO comm 0x95cdea0 rank 6 nranks 8 cudaDev 6 busId 1d0 - Abort COMPLETE
b391fedc46b4:33:175 [0] NCCL INFO comm 0xa0b0200 rank 1 nranks 8 cudaDev 1 busId 180 - Abort COMPLETE
b391fedc46b4:36:176 [0] NCCL INFO comm 0x8c08d40 rank 4 nranks 8 cudaDev 4 busId 1b0 - Abort COMPLETE
b391fedc46b4:37:173 [0] NCCL INFO comm 0x9c86040 rank 5 nranks 8 cudaDev 5 busId 1c0 - Abort COMPLETE

Included in this PR is also the inclusion of a new github action to verify the docker image will build successfully.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

ndbaker1 · 2024-06-26T22:16:04Z

.github/workflows/ci.yaml

+    steps:
+    - uses: actions/checkout@v3
+    - run: docker build --file e2e2/test/images/bert-training/Dockerfile.bert-training e2e2/test/images/bert-training


an idea for the future (since there will be more to come), maybe we standardizing a images/XXX/{Dockerfile, ...} structure for all our images and then create a job matrix for test image build

ndbaker1 · 2024-06-26T22:19:54Z

e2e2/test/images/bert-training/train.py

+    os.environ['MASTER_ADDR'] = os.environ['MASTER_ADDR']  # Kubernetes sets this
+    os.environ['MASTER_PORT'] = os.environ['MASTER_PORT']  # Kubernetes sets this


move this to docker ENV with defaults

What are your thoughts on just setting the defaults here in the script (I should have done this in the first place) versus in the dockerfile?

The default values should only be used during local dev, which is why I could see them belonging in the training script. Kubernetes should set them at runtime.

Kubernetes should set them at runtime.

meaning its part of the container spec right? those env's should override what's in the docker ENV, so i think it just generally makes sense to put it in the Dockerfile/image definition

Yep, it would be in the container spec. Works for me, I'll make the change.

@ndbaker1 Done

Issacwww · 2024-06-26T22:42:38Z

e2e2/test/images/bert-training/train.py

+    print(f"Process {rank} - Training time: {training_time:.2f} seconds")
+    print(f"Process {rank} - Throughput: {throughput:.2f} samples/second")


do we need to dump this output to disk so we can use to upload to s3?

Potentially... I also was considering writing directly to s3 as well, but was curious to hear other's perspective(s). My intuition says writing to S3 is the long-term solution (once a stable schema is solidified), but short term just doing something like writing to disk or stdout might be the way to go.

i agree it can go to s3, cloudwatch, or etc once we know where this is going. we should have this output also printed for sure though.

it should be fine dump to disk for short term, and you have enough flexibility to play POC with different long-term destinations

@Issacwww Are there any concerns/considerations with writing from the container to the host machine?

oh, good call out, this is on the tod worker, dump to disk has no difference between stdout... so stdout should be fine now.

…erfile for the e2e BERT training task

…github action

…ER_HOST

…ty it up.

…or MPI, NCCL, and EFA.

weicongw · 2024-06-26T23:46:28Z

e2e2/test/images/bert-training/train.py

+
+    start_time = time.time()
+
+    for epoch in range(1):  # Short run for testing


Should we let the program read the epoch from an environment variable or argument? This way we can allow larger instances (e.g. p5) to run more epochs without changing the code.

What would the purpose of having more epochs for larger instance sizes? Are you thinking about it purely from the perspective of wanting the tests to last the same amount of time for each instance type?

Just some random thoughts. I was thinking we could run more epochs for larger instances to get more accurate performance data. Additionally, we could reuse this code for our future long-running tests (like soak tests).

Gotcha... Yeah I certainly appreciate the idea behind re-usability, but there's a good chance this current test isn't the best option for a SOAP test anyways.

As far as more epochs for larger instance types, it depends on what your end goal is. For the tests we're running, and the metrics we're looking to gather, I don't see any benefit in doing this at this time.

weicongw · 2024-07-11T19:28:19Z

e2e2/test/images/bert-training/Dockerfile.bert-training

Had a discussion with @cartermckinnon. I think we could reuse the e2e2/test/images/nvidia/Dockerfile, so we don't need to maintain multiple images and dockerfiles.

I mean that's fine from a base image perspective, since many of the dependencies will be shared among test types (i.e. unit/training/inference), but training and inference will both require unique dependencies on top of what's included in e2e2/test/images/nvidia/Dockerfile. The dependencies between training and inference might be the same at this point, but this could very well change in the future.

Can we add those unique dependencies to the e2e2/test/images/nvidia/Dockerfile?

@cartermckinnon I agree though that further thought needs to be put into the test directory structure before we go too much further. I'm not sure how many more tests we're looking to add, but the current approach doesn't scale particularly well.

@weicongw I mean sure we can, but then you're adding another ~7GB of deps to that image, which are totally unnecessary for the unit tests. Also, if we ever added the another test (e.g. ResNet), they very well could have their own unique dependencies as well. This will especially be true if we ever want to validate other frameworks than what are currently being utilized.

…ated

…ce to be consistent with the other test images

…duplicate

mattcjo · 2024-07-18T20:22:55Z

A few major updates were made with the last few commits. The GPU was incorrectly being assigned to the process' world rank and not its local rank. This was leading to a failure when trying to run on a multi-node cluster. I've rectified the problem, and the script will now successfully run on a multi-node cluster. Here's the output from a successful workload being ran on a cluster of 4 nodes, where each node is a p3.16xlarge instance (NOTE - EFA is not enabled for this instance type):

Click to expand

Warning: Permanently added 'bert-mpi-training-worker-0.bert-mpi-training.default.svc' (ED25519) to the list of known hosts.
Warning: Permanently added 'bert-mpi-training-worker-1.bert-mpi-training.default.svc' (ED25519) to the list of known hosts.
Warning: Permanently added 'bert-mpi-training-worker-3.bert-mpi-training.default.svc' (ED25519) to the list of known hosts.
Warning: Permanently added 'bert-mpi-training-worker-2.bert-mpi-training.default.svc' (ED25519) to the list of known hosts.
[1,1]<stdout>:Process started for rank 1 with local rank 1
[1,1]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,1]<stderr>:  warnings.warn(
[1,0]<stdout>:Process started for rank 0 with local rank 0
[1,0]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,0]<stderr>:  warnings.warn(
[1,4]<stdout>:Process started for rank 4 with local rank 4
[1,4]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,4]<stderr>:  warnings.warn(
[1,11]<stdout>:Process started for rank 11 with local rank 3
[1,11]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,11]<stderr>:  warnings.warn(
[1,13]<stdout>:Process started for rank 13 with local rank 5
[1,13]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,13]<stderr>:  warnings.warn(
[1,15]<stdout>:Process started for rank 15 with local rank 7
[1,15]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,15]<stderr>:  warnings.warn(
[1,8]<stdout>:Process started for rank 8 with local rank 0
[1,8]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,8]<stderr>:  warnings.warn(
[1,12]<stdout>:Process started for rank 12 with local rank 4
[1,9]<stdout>:Process started for rank 9 with local rank 1
[1,12]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,12]<stderr>:  warnings.warn(
[1,9]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,9]<stderr>:  warnings.warn(
[1,14]<stdout>:Process started for rank 14 with local rank 6
[1,14]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,14]<stderr>:  warnings.warn(
[1,3]<stdout>:Process started for rank 3 with local rank 3
[1,3]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,3]<stderr>:  warnings.warn(
[1,10]<stdout>:Process started for rank 10 with local rank 2
[1,10]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,10]<stderr>:  warnings.warn(
[1,28]<stdout>:Process started for rank 28 with local rank 4
[1,24]<stdout>:Process started for rank 24 with local rank 0
[1,25]<stdout>:Process started for rank 25 with local rank 1
[1,31]<stdout>:Process started for rank 31 with local rank 7
[1,24]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,24]<stderr>:  warnings.warn(
[1,28]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,28]<stderr>:  warnings.warn(
[1,25]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,25]<stderr>:  warnings.warn(
[1,31]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,31]<stderr>:  warnings.warn(
[1,29]<stdout>:Process started for rank 29 with local rank 5
[1,29]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,29]<stderr>:  warnings.warn(
[1,26]<stdout>:Process started for rank 26 with local rank 2
[1,26]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,26]<stderr>:  warnings.warn(
[1,2]<stdout>:Process started for rank 2 with local rank 2
[1,30]<stdout>:Process started for rank 30 with local rank 6
[1,2]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,2]<stderr>:  warnings.warn(
[1,30]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,30]<stderr>:  warnings.warn(
[1,21]<stdout>:Process started for rank 21 with local rank 5
[1,16]<stdout>:Process started for rank 16 with local rank 0
[1,17]<stdout>:Process started for rank 17 with local rank 1
[1,23]<stdout>:Process started for rank 23 with local rank 7
[1,16]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,16]<stderr>:  warnings.warn(
[1,21]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,21]<stderr>:  warnings.warn(
[1,17]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,17]<stderr>:  warnings.warn(
[1,23]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,23]<stderr>:  warnings.warn(
[1,19]<stdout>:Process started for rank 19 with local rank 3
[1,19]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,19]<stderr>:  warnings.warn(
[1,20]<stdout>:Process started for rank 20 with local rank 4
[1,20]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,20]<stderr>:  warnings.warn(
[1,27]<stdout>:Process started for rank 27 with local rank 3
[1,27]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,27]<stderr>:  warnings.warn(
[1,7]<stdout>:Process started for rank 7 with local rank 7
[1,7]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,7]<stderr>:  warnings.warn(
[1,18]<stdout>:Process started for rank 18 with local rank 2
[1,6]<stdout>:Process started for rank 6 with local rank 6
[1,18]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,18]<stderr>:  warnings.warn(
[1,6]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,6]<stderr>:  warnings.warn(
[1,5]<stdout>:Process started for rank 5 with local rank 5
[1,5]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,5]<stderr>:  warnings.warn(
[1,22]<stdout>:Process started for rank 22 with local rank 6
[1,22]<stderr>:/usr/local/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
[1,22]<stderr>:  warnings.warn(
[1,25]<stdout>:successfully downloaded model and tokenizer for rank: 25
[1,29]<stdout>:successfully downloaded model and tokenizer for rank: 29
[1,30]<stdout>:successfully downloaded model and tokenizer for rank: 30
[1,27]<stdout>:successfully downloaded model and tokenizer for rank: 27
[1,31]<stdout>:successfully downloaded model and tokenizer for rank: 31
[1,28]<stdout>:successfully downloaded model and tokenizer for rank: 28
[1,26]<stdout>:successfully downloaded model and tokenizer for rank: 26
[1,24]<stdout>:successfully downloaded model and tokenizer for rank: 24
[1,12]<stdout>:successfully downloaded model and tokenizer for rank: 12
[1,8]<stdout>:successfully downloaded model and tokenizer for rank: 8
[1,11]<stdout>:successfully downloaded model and tokenizer for rank: 11
[1,10]<stdout>:successfully downloaded model and tokenizer for rank: 10
[1,15]<stdout>:successfully downloaded model and tokenizer for rank: 15
[1,9]<stdout>:successfully downloaded model and tokenizer for rank: 9
[1,0]<stdout>:successfully downloaded model and tokenizer for rank: 0
[1,14]<stdout>:successfully downloaded model and tokenizer for rank: 14
[1,1]<stdout>:successfully downloaded model and tokenizer for rank: 1
[1,4]<stdout>:successfully downloaded model and tokenizer for rank: 4
[1,5]<stdout>:successfully downloaded model and tokenizer for rank: 5
[1,2]<stdout>:successfully downloaded model and tokenizer for rank: 2
[1,7]<stdout>:successfully downloaded model and tokenizer for rank: 7
[1,3]<stdout>:successfully downloaded model and tokenizer for rank: 3
[1,13]<stdout>:successfully downloaded model and tokenizer for rank: 13
[1,6]<stdout>:successfully downloaded model and tokenizer for rank: 6
[1,19]<stdout>:successfully downloaded model and tokenizer for rank: 19
[1,16]<stdout>:successfully downloaded model and tokenizer for rank: 16
[1,20]<stdout>:successfully downloaded model and tokenizer for rank: 20
[1,18]<stdout>:successfully downloaded model and tokenizer for rank: 18
[1,22]<stdout>:successfully downloaded model and tokenizer for rank: 22
[1,17]<stdout>:successfully downloaded model and tokenizer for rank: 17
[1,21]<stdout>:successfully downloaded model and tokenizer for rank: 21
[1,23]<stdout>:successfully downloaded model and tokenizer for rank: 23
[1,8]<stdout>:Process 8 initialized, using GPU 0
[1,16]<stdout>:Process 16 initialized, using GPU 0
[1,15]<stdout>:Process 15 initialized, using GPU 7
[1,9]<stdout>:Process 9 initialized, using GPU 1
[1,13]<stdout>:Process 13 initialized, using GPU 5
[1,14]<stdout>:Process 14 initialized, using GPU 6
[1,1]<stdout>:Process 1 initialized, using GPU 1
[1,7]<stdout>:Process 7 initialized, using GPU 7
[1,3]<stdout>:Process 3 initialized, using GPU 3
[1,19]<stdout>:Process 19 initialized, using GPU 3
[1,6]<stdout>:Process 6 initialized, using GPU 6
[1,4]<stdout>:Process 4 initialized, using GPU 4
[1,5]<stdout>:Process 5 initialized, using GPU 5
[1,2]<stdout>:Process 2 initialized, using GPU 2
[1,24]<stdout>:Process 24 initialized, using GPU 0
[1,0]<stdout>:Process 0 initialized, using GPU 0
[1,23]<stdout>:Process 23 initialized, using GPU 7
[1,18]<stdout>:Process 18 initialized, using GPU 2
[1,21]<stdout>:Process 21 initialized, using GPU 5
[1,17]<stdout>:Process 17 initialized, using GPU 1
[1,20]<stdout>:Process 20 initialized, using GPU 4
[1,22]<stdout>:Process 22 initialized, using GPU 6
[1,11]<stdout>:Process 11 initialized, using GPU 3
[1,10]<stdout>:Process 10 initialized, using GPU 2
[1,12]<stdout>:Process 12 initialized, using GPU 4
[1,0]<stdout>:bert-mpi-training-worker-0:21:21 [0] NCCL INFO Bootstrap : Using eth0:192.168.29.226<0>
[1,0]<stdout>:bert-mpi-training-worker-0:21:21 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,0]<stdout>:bert-mpi-training-worker-0:21:21 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,0]<stdout>:bert-mpi-training-worker-0:21:21 [0] NCCL INFO cudaDriverVersion 12050
[1,0]<stdout>:NCCL version 2.20.5+cuda12.4
[1,7]<stdout>:bert-mpi-training-worker-0:28:28 [7] NCCL INFO cudaDriverVersion 12050
[1,7]<stdout>:bert-mpi-training-worker-0:28:28 [7] NCCL INFO Bootstrap : Using eth0:192.168.29.226<0>
[1,5]<stdout>:bert-mpi-training-worker-0:26:26 [5] NCCL INFO cudaDriverVersion 12050
[1,1]<stdout>:bert-mpi-training-worker-0:22:22 [1] NCCL INFO cudaDriverVersion 12050
[1,4]<stdout>:bert-mpi-training-worker-0:25:25 [4] NCCL INFO cudaDriverVersion 12050
[1,2]<stdout>:bert-mpi-training-worker-0:23:23 [2] NCCL INFO cudaDriverVersion 12050
[1,3]<stdout>:bert-mpi-training-worker-0:24:24 [3] NCCL INFO cudaDriverVersion 12050
[1,6]<stdout>:bert-mpi-training-worker-0:27:27 [6] NCCL INFO cudaDriverVersion 12050
[1,1]<stdout>:bert-mpi-training-worker-0:22:22 [1] NCCL INFO Bootstrap : Using eth0:192.168.29.226<0>
[1,5]<stdout>:bert-mpi-training-worker-0:26:26 [5] NCCL INFO Bootstrap : Using eth0:192.168.29.226<0>
[1,6]<stdout>:bert-mpi-training-worker-0:27:27 [6] NCCL INFO Bootstrap : Using eth0:192.168.29.226<0>
[1,2]<stdout>:bert-mpi-training-worker-0:23:23 [2] NCCL INFO Bootstrap : Using eth0:192.168.29.226<0>
[1,3]<stdout>:bert-mpi-training-worker-0:24:24 [3] NCCL INFO Bootstrap : Using eth0:192.168.29.226<0>
[1,4]<stdout>:bert-mpi-training-worker-0:25:25 [4] NCCL INFO Bootstrap : Using eth0:192.168.29.226<0>
[1,7]<stdout>:bert-mpi-training-worker-0:28:28 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,7]<stdout>:bert-mpi-training-worker-0:28:28 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,1]<stdout>:bert-mpi-training-worker-0:22:22 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,1]<stdout>:bert-mpi-training-worker-0:22:22 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,5]<stdout>:bert-mpi-training-worker-0:26:26 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,5]<stdout>:bert-mpi-training-worker-0:26:26 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,2]<stdout>:bert-mpi-training-worker-0:23:23 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,2]<stdout>:bert-mpi-training-worker-0:23:23 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,6]<stdout>:bert-mpi-training-worker-0:27:27 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,6]<stdout>:bert-mpi-training-worker-0:27:27 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,4]<stdout>:bert-mpi-training-worker-0:25:25 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,4]<stdout>:bert-mpi-training-worker-0:25:25 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,3]<stdout>:bert-mpi-training-worker-0:24:24 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,3]<stdout>:bert-mpi-training-worker-0:24:24 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,15]<stdout>:bert-mpi-training-worker-1:28:28 [7] NCCL INFO cudaDriverVersion 12050
[1,8]<stdout>:bert-mpi-training-worker-1:21:21 [0] NCCL INFO cudaDriverVersion 12050
[1,15]<stdout>:bert-mpi-training-worker-1:28:28 [7] NCCL INFO Bootstrap : Using eth0:192.168.60.235<0>
[1,13]<stdout>:bert-mpi-training-worker-1:26:26 [5] NCCL INFO cudaDriverVersion 12050
[1,9]<stdout>:bert-mpi-training-worker-1:22:22 [1] NCCL INFO cudaDriverVersion 12050
[1,14]<stdout>:bert-mpi-training-worker-1:27:27 [6] NCCL INFO cudaDriverVersion 12050
[1,13]<stdout>:bert-mpi-training-worker-1:26:26 [5] NCCL INFO Bootstrap : Using eth0:192.168.60.235<0>
[1,8]<stdout>:bert-mpi-training-worker-1:21:21 [0] NCCL INFO Bootstrap : Using eth0:192.168.60.235<0>
[1,9]<stdout>:bert-mpi-training-worker-1:22:22 [1] NCCL INFO Bootstrap : Using eth0:192.168.60.235<0>
[1,14]<stdout>:bert-mpi-training-worker-1:27:27 [6] NCCL INFO Bootstrap : Using eth0:192.168.60.235<0>
[1,12]<stdout>:bert-mpi-training-worker-1:25:25 [4] NCCL INFO cudaDriverVersion 12050
[1,12]<stdout>:bert-mpi-training-worker-1:25:25 [4] NCCL INFO Bootstrap : Using eth0:192.168.60.235<0>
[1,15]<stdout>:bert-mpi-training-worker-1:28:28 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,15]<stdout>:bert-mpi-training-worker-1:28:28 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,8]<stdout>:bert-mpi-training-worker-1:21:21 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,8]<stdout>:bert-mpi-training-worker-1:21:21 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,13]<stdout>:bert-mpi-training-worker-1:26:26 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,13]<stdout>:bert-mpi-training-worker-1:26:26 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,9]<stdout>:bert-mpi-training-worker-1:22:22 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,14]<stdout>:bert-mpi-training-worker-1:27:27 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,14]<stdout>:bert-mpi-training-worker-1:27:27 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,9]<stdout>:bert-mpi-training-worker-1:22:22 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,12]<stdout>:bert-mpi-training-worker-1:25:25 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,12]<stdout>:bert-mpi-training-worker-1:25:25 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,11]<stdout>:bert-mpi-training-worker-1:24:24 [3] NCCL INFO cudaDriverVersion 12050
[1,11]<stdout>:bert-mpi-training-worker-1:24:24 [3] NCCL INFO Bootstrap : Using eth0:192.168.60.235<0>
[1,18]<stdout>:bert-mpi-training-worker-2:23:23 [2] NCCL INFO cudaDriverVersion 12050
[1,16]<stdout>:bert-mpi-training-worker-2:21:21 [0] NCCL INFO cudaDriverVersion 12050
[1,21]<stdout>:bert-mpi-training-worker-2:26:26 [5] NCCL INFO cudaDriverVersion 12050
[1,17]<stdout>:bert-mpi-training-worker-2:22:22 [1] NCCL INFO cudaDriverVersion 12050
[1,23]<stdout>:bert-mpi-training-worker-2:28:28 [7] NCCL INFO cudaDriverVersion 12050
[1,20]<stdout>:bert-mpi-training-worker-2:25:25 [4] NCCL INFO cudaDriverVersion 12050
[1,22]<stdout>:bert-mpi-training-worker-2:27:27 [6] NCCL INFO cudaDriverVersion 12050
[1,19]<stdout>:bert-mpi-training-worker-2:24:24 [3] NCCL INFO cudaDriverVersion 12050
[1,11]<stdout>:bert-mpi-training-worker-1:24:24 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,11]<stdout>:bert-mpi-training-worker-1:24:24 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,16]<stdout>:bert-mpi-training-worker-2:21:21 [0] NCCL INFO Bootstrap : Using eth0:192.168.77.153<0>
[1,18]<stdout>:bert-mpi-training-worker-2:23:23 [2] NCCL INFO Bootstrap : Using eth0:192.168.77.153<0>
[1,21]<stdout>:bert-mpi-training-worker-2:26:26 [5] NCCL INFO Bootstrap : Using eth0:192.168.77.153<0>
[1,17]<stdout>:bert-mpi-training-worker-2:22:22 [1] NCCL INFO Bootstrap : Using eth0:192.168.77.153<0>
[1,23]<stdout>:bert-mpi-training-worker-2:28:28 [7] NCCL INFO Bootstrap : Using eth0:192.168.77.153<0>
[1,22]<stdout>:bert-mpi-training-worker-2:27:27 [6] NCCL INFO Bootstrap : Using eth0:192.168.77.153<0>
[1,20]<stdout>:bert-mpi-training-worker-2:25:25 [4] NCCL INFO Bootstrap : Using eth0:192.168.77.153<0>
[1,19]<stdout>:bert-mpi-training-worker-2:24:24 [3] NCCL INFO Bootstrap : Using eth0:192.168.77.153<0>
[1,22]<stdout>:bert-mpi-training-worker-2:27:27 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,22]<stdout>:bert-mpi-training-worker-2:27:27 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,16]<stdout>:bert-mpi-training-worker-2:21:21 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,16]<stdout>:bert-mpi-training-worker-2:21:21 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,18]<stdout>:bert-mpi-training-worker-2:23:23 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,18]<stdout>:bert-mpi-training-worker-2:23:23 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,23]<stdout>:bert-mpi-training-worker-2:28:28 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,23]<stdout>:bert-mpi-training-worker-2:28:28 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,21]<stdout>:bert-mpi-training-worker-2:26:26 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,21]<stdout>:bert-mpi-training-worker-2:26:26 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,17]<stdout>:bert-mpi-training-worker-2:22:22 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,17]<stdout>:bert-mpi-training-worker-2:22:22 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,20]<stdout>:bert-mpi-training-worker-2:25:25 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,20]<stdout>:bert-mpi-training-worker-2:25:25 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,19]<stdout>:bert-mpi-training-worker-2:24:24 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,19]<stdout>:bert-mpi-training-worker-2:24:24 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,10]<stdout>:bert-mpi-training-worker-1:23:23 [2] NCCL INFO cudaDriverVersion 12050
[1,10]<stdout>:bert-mpi-training-worker-1:23:23 [2] NCCL INFO Bootstrap : Using eth0:192.168.60.235<0>
[1,10]<stdout>:bert-mpi-training-worker-1:23:23 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,10]<stdout>:bert-mpi-training-worker-1:23:23 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO NET/OFI Configuring AWS-specific options
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO NET/OFI Setting provider_filter to efa
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,0]<stdout>:
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,0]<stdout>:
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO NET/IB : No device found.
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.29.226<0>
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Using non-device net plugin version 0
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Using network Socket
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO NET/OFI Configuring AWS-specific options
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO NET/OFI Setting provider_filter to efa
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO NET/OFI Configuring AWS-specific options
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO NET/OFI Setting provider_filter to efa
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO NET/OFI Configuring AWS-specific options
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO NET/OFI Setting provider_filter to efa
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO NET/OFI Configuring AWS-specific options
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO NET/OFI Setting provider_filter to efa
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO NET/OFI Configuring AWS-specific options
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO NET/OFI Setting provider_filter to efa
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO NET/OFI Configuring AWS-specific options
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO NET/OFI Setting provider_filter to efa
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO NET/OFI Configuring AWS-specific options
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO NET/OFI Setting provider_filter to efa
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO NET/OFI Configuring AWS-specific options
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO NET/OFI Configuring AWS-specific options
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO NET/OFI Setting provider_filter to efa
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO NET/OFI Setting provider_filter to efa
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO NET/OFI Configuring AWS-specific options
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO NET/OFI Setting provider_filter to efa
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO NET/OFI Configuring AWS-specific options
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO NET/OFI Setting provider_filter to efa
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO NET/OFI Configuring AWS-specific options
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO NET/OFI Setting provider_filter to efa
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO NET/OFI Configuring AWS-specific options
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO NET/OFI Setting provider_filter to efa
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO NET/OFI Configuring AWS-specific options
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO NET/OFI Setting provider_filter to efa
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO NET/OFI Configuring AWS-specific options
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO NET/OFI Setting provider_filter to efa
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO NET/OFI Configuring AWS-specific options
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO NET/OFI Setting provider_filter to efa
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO NET/OFI Configuring AWS-specific options
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO NET/OFI Setting provider_filter to efa
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO NET/OFI Configuring AWS-specific options
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO NET/OFI Setting provider_filter to efa
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO NET/OFI Configuring AWS-specific options
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO NET/OFI Setting provider_filter to efa
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO NET/OFI Configuring AWS-specific options
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO NET/OFI Setting provider_filter to efa
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO NET/OFI Configuring AWS-specific options
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO NET/OFI Setting provider_filter to efa
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO NET/OFI Configuring AWS-specific options
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO NET/OFI Setting provider_filter to efa
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO NET/OFI Configuring AWS-specific options
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO NET/OFI Setting provider_filter to efa
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,7]<stdout>:
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,7]<stdout>:
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO NET/IB : No device found.
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO NET/Socket : Using [0]eth0:192.168.29.226<0>
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO Using non-device net plugin version 0
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO Using network Socket
[1,4]<stdout>:
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,4]<stdout>:
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO NET/IB : No device found.
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO NET/Socket : Using [0]eth0:192.168.29.226<0>
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO Using non-device net plugin version 0
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO Using network Socket
[1,1]<stdout>:
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,1]<stdout>:
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO NET/IB : No device found.
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO NET/Socket : Using [0]eth0:192.168.29.226<0>
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO Using non-device net plugin version 0
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO Using network Socket
[1,3]<stdout>:
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,3]<stdout>:
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO NET/IB : No device found.
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO NET/Socket : Using [0]eth0:192.168.29.226<0>
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO Using non-device net plugin version 0
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO Using network Socket
[1,6]<stdout>:
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,6]<stdout>:
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO NET/IB : No device found.
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO NET/Socket : Using [0]eth0:192.168.29.226<0>
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO Using non-device net plugin version 0
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO Using network Socket
[1,15]<stdout>:
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,15]<stdout>:
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,23]<stdout>:
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,23]<stdout>:
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO NET/IB : No device found.
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO NET/IB : No device found.
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO NET/Socket : Using [0]eth0:192.168.60.235<0>
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO Using non-device net plugin version 0
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO Using network Socket
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO NET/Socket : Using [0]eth0:192.168.77.153<0>
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO Using non-device net plugin version 0
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO Using network Socket
[1,16]<stdout>:
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,16]<stdout>:
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO NET/IB : No device found.
[1,2]<stdout>:
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,2]<stdout>:
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.77.153<0>
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Using non-device net plugin version 0
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Using network Socket
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO NET/IB : No device found.
[1,8]<stdout>:
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,8]<stdout>:
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO NET/Socket : Using [0]eth0:192.168.29.226<0>
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO Using non-device net plugin version 0
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO Using network Socket
[1,9]<stdout>:
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,9]<stdout>:
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO NET/IB : No device found.
[1,10]<stdout>:
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,10]<stdout>:
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO NET/IB : No device found.
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.60.235<0>
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Using non-device net plugin version 0
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Using network Socket
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO NET/Socket : Using [0]eth0:192.168.60.235<0>
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO Using non-device net plugin version 0
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO Using network Socket
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO NET/IB : No device found.
[1,5]<stdout>:
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,5]<stdout>:
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO NET/Socket : Using [0]eth0:192.168.60.235<0>
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO Using non-device net plugin version 0
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO Using network Socket
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO NET/IB : No device found.
[1,11]<stdout>:
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,11]<stdout>:
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,19]<stdout>:
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,19]<stdout>:
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO NET/Socket : Using [0]eth0:192.168.29.226<0>
[1,13]<stdout>:
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,13]<stdout>:
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO Using non-device net plugin version 0
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO Using network Socket
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO NET/IB : No device found.
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO NET/IB : No device found.
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO NET/IB : No device found.
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO NET/Socket : Using [0]eth0:192.168.77.153<0>
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO Using non-device net plugin version 0
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO Using network Socket
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO NET/Socket : Using [0]eth0:192.168.60.235<0>
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO NET/Socket : Using [0]eth0:192.168.60.235<0>
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO Using non-device net plugin version 0
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO Using network Socket
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO Using non-device net plugin version 0
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO Using network Socket
[1,14]<stdout>:
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,14]<stdout>:
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO NET/IB : No device found.
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO NET/Socket : Using [0]eth0:192.168.60.235<0>
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO Using non-device net plugin version 0
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO Using network Socket
[1,12]<stdout>:
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,12]<stdout>:
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO NET/IB : No device found.
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO NET/Socket : Using [0]eth0:192.168.60.235<0>
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO Using non-device net plugin version 0
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO Using network Socket
[1,18]<stdout>:
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,18]<stdout>:
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO NET/IB : No device found.
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO NET/Socket : Using [0]eth0:192.168.77.153<0>
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO Using non-device net plugin version 0
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO Using network Socket
[1,17]<stdout>:
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,17]<stdout>:
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO NET/IB : No device found.
[1,20]<stdout>:
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,20]<stdout>:
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO NET/Socket : Using [0]eth0:192.168.77.153<0>
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO Using non-device net plugin version 0
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO Using network Socket
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO NET/IB : No device found.
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO NET/Socket : Using [0]eth0:192.168.77.153<0>
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO Using non-device net plugin version 0
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO Using network Socket
[1,22]<stdout>:
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,22]<stdout>:
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO NET/IB : No device found.
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO NET/Socket : Using [0]eth0:192.168.77.153<0>
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO Using non-device net plugin version 0
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO Using network Socket
[1,21]<stdout>:
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,21]<stdout>:
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO NET/IB : No device found.
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO NET/Socket : Using [0]eth0:192.168.77.153<0>
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO Using non-device net plugin version 0
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO Using network Socket
[1,29]<stdout>:Process 29 initialized, using GPU 5
[1,25]<stdout>:Process 25 initialized, using GPU 1
[1,31]<stdout>:Process 31 initialized, using GPU 7
[1,30]<stdout>:Process 30 initialized, using GPU 6
[1,28]<stdout>:Process 28 initialized, using GPU 4
[1,26]<stdout>:Process 26 initialized, using GPU 2
[1,27]<stdout>:Process 27 initialized, using GPU 3
[1,29]<stdout>:bert-mpi-training-worker-3:26:26 [5] NCCL INFO cudaDriverVersion 12050
[1,29]<stdout>:bert-mpi-training-worker-3:26:26 [5] NCCL INFO Bootstrap : Using eth0:192.168.45.173<0>
[1,25]<stdout>:bert-mpi-training-worker-3:22:22 [1] NCCL INFO cudaDriverVersion 12050
[1,29]<stdout>:bert-mpi-training-worker-3:26:26 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,29]<stdout>:bert-mpi-training-worker-3:26:26 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,25]<stdout>:bert-mpi-training-worker-3:22:22 [1] NCCL INFO Bootstrap : Using eth0:192.168.45.173<0>
[1,25]<stdout>:bert-mpi-training-worker-3:22:22 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,25]<stdout>:bert-mpi-training-worker-3:22:22 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,30]<stdout>:bert-mpi-training-worker-3:27:27 [6] NCCL INFO cudaDriverVersion 12050
[1,30]<stdout>:bert-mpi-training-worker-3:27:27 [6] NCCL INFO Bootstrap : Using eth0:192.168.45.173<0>
[1,30]<stdout>:bert-mpi-training-worker-3:27:27 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,30]<stdout>:bert-mpi-training-worker-3:27:27 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,31]<stdout>:bert-mpi-training-worker-3:28:28 [7] NCCL INFO cudaDriverVersion 12050
[1,31]<stdout>:bert-mpi-training-worker-3:28:28 [7] NCCL INFO Bootstrap : Using eth0:192.168.45.173<0>
[1,31]<stdout>:bert-mpi-training-worker-3:28:28 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,31]<stdout>:bert-mpi-training-worker-3:28:28 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,24]<stdout>:bert-mpi-training-worker-3:21:21 [0] NCCL INFO cudaDriverVersion 12050
[1,24]<stdout>:bert-mpi-training-worker-3:21:21 [0] NCCL INFO Bootstrap : Using eth0:192.168.45.173<0>
[1,24]<stdout>:bert-mpi-training-worker-3:21:21 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,24]<stdout>:bert-mpi-training-worker-3:21:21 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,28]<stdout>:bert-mpi-training-worker-3:25:25 [4] NCCL INFO cudaDriverVersion 12050
[1,28]<stdout>:bert-mpi-training-worker-3:25:25 [4] NCCL INFO Bootstrap : Using eth0:192.168.45.173<0>
[1,28]<stdout>:bert-mpi-training-worker-3:25:25 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,28]<stdout>:bert-mpi-training-worker-3:25:25 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,27]<stdout>:bert-mpi-training-worker-3:24:24 [3] NCCL INFO cudaDriverVersion 12050
[1,27]<stdout>:bert-mpi-training-worker-3:24:24 [3] NCCL INFO Bootstrap : Using eth0:192.168.45.173<0>
[1,27]<stdout>:bert-mpi-training-worker-3:24:24 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,27]<stdout>:bert-mpi-training-worker-3:24:24 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,26]<stdout>:bert-mpi-training-worker-3:23:23 [2] NCCL INFO cudaDriverVersion 12050
[1,26]<stdout>:bert-mpi-training-worker-3:23:23 [2] NCCL INFO Bootstrap : Using eth0:192.168.45.173<0>
[1,26]<stdout>:bert-mpi-training-worker-3:23:23 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[1,26]<stdout>:bert-mpi-training-worker-3:23:23 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (>= v5). ncclCollNetPlugin symbols v4 and lower are not supported.
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO NET/OFI Configuring AWS-specific options
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO NET/OFI Setting provider_filter to efa
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO NET/OFI Configuring AWS-specific options
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO NET/OFI Setting provider_filter to efa
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,29]<stdout>:
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,29]<stdout>:
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO NET/IB : No device found.
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO NET/Socket : Using [0]eth0:192.168.45.173<0>
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO Using non-device net plugin version 0
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO Using network Socket
[1,25]<stdout>:
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,25]<stdout>:
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO NET/IB : No device found.
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO NET/Socket : Using [0]eth0:192.168.45.173<0>
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO Using non-device net plugin version 0
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO Using network Socket
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO NET/OFI Configuring AWS-specific options
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO NET/OFI Setting provider_filter to efa
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO NET/OFI Configuring AWS-specific options
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO NET/OFI Setting provider_filter to efa
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,30]<stdout>:
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,30]<stdout>:
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO NET/IB : No device found.
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO NET/Socket : Using [0]eth0:192.168.45.173<0>
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO Using non-device net plugin version 0
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO Using network Socket
[1,31]<stdout>:
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,31]<stdout>:
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO NET/IB : No device found.
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO NET/Socket : Using [0]eth0:192.168.45.173<0>
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO Using non-device net plugin version 0
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO Using network Socket
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO NET/OFI Configuring AWS-specific options
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO NET/OFI Setting provider_filter to efa
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,24]<stdout>:
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,24]<stdout>:
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO NET/OFI Configuring AWS-specific options
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO NET/OFI Setting provider_filter to efa
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO NET/IB : No device found.
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO NET/Socket : Using [0]eth0:192.168.45.173<0>
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO Using non-device net plugin version 0
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO Using network Socket
[1,28]<stdout>:
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,28]<stdout>:
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO NET/OFI Configuring AWS-specific options
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO NET/OFI Setting provider_filter to efa
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.9.1-aws
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO NET/OFI Using Libfabric version 1.21
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO NET/IB : No device found.
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO NET/OFI Using CUDA driver version 12050
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO NET/OFI Configuring AWS-specific options
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO NET/OFI Setting provider_filter to efa
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO NET/OFI Internode latency set at 150.0 us
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO NET/OFI Using transport protocol SENDRECV
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO NET/Socket : Using [0]eth0:192.168.45.173<0>
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO Using non-device net plugin version 0
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO Using network Socket
[1,27]<stdout>:
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,27]<stdout>:
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO NET/IB : No device found.
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO NET/Socket : Using [0]eth0:192.168.45.173<0>
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO Using non-device net plugin version 0
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO Using network Socket
[1,26]<stdout>:
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] nccl_net_ofi_create_plugin:204 NCCL WARN NET/OFI Failed to initialize sendrecv protocol
[1,26]<stdout>:
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] nccl_net_ofi_create_plugin:257 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO NET/IB : No device found.
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO NET/Socket : Using [0]eth0:192.168.45.173<0>
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO Using non-device net plugin version 0
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO Using network Socket
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO comm 0x5653dbf7d100 rank 2 nranks 32 cudaDev 2 nvmlDev 2 busId 190 commId 0x837dd0976e1b4338 - Init START
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO comm 0x5582a4a63380 rank 23 nranks 32 cudaDev 7 nvmlDev 7 busId 1e0 commId 0x837dd0976e1b4338 - Init START
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO comm 0x5648da365f80 rank 22 nranks 32 cudaDev 6 nvmlDev 6 busId 1d0 commId 0x837dd0976e1b4338 - Init START
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO comm 0x55986c6b8ac0 rank 3 nranks 32 cudaDev 3 nvmlDev 3 busId 1a0 commId 0x837dd0976e1b4338 - Init START
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO comm 0x5640d3700dc0 rank 6 nranks 32 cudaDev 6 nvmlDev 6 busId 1d0 commId 0x837dd0976e1b4338 - Init START
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO comm 0x5641c3f88e40 rank 4 nranks 32 cudaDev 4 nvmlDev 4 busId 1b0 commId 0x837dd0976e1b4338 - Init START
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO comm 0x564c64f91780 rank 5 nranks 32 cudaDev 5 nvmlDev 5 busId 1c0 commId 0x837dd0976e1b4338 - Init START
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO comm 0x563f66c76d80 rank 7 nranks 32 cudaDev 7 nvmlDev 7 busId 1e0 commId 0x837dd0976e1b4338 - Init START
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO comm 0x55bc31870680 rank 0 nranks 32 cudaDev 0 nvmlDev 0 busId 170 commId 0x837dd0976e1b4338 - Init START
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO comm 0x5613f202fdc0 rank 9 nranks 32 cudaDev 1 nvmlDev 1 busId 180 commId 0x837dd0976e1b4338 - Init START
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO comm 0x55bdd0698c80 rank 28 nranks 32 cudaDev 4 nvmlDev 4 busId 1b0 commId 0x837dd0976e1b4338 - Init START
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO comm 0x559fe7380640 rank 1 nranks 32 cudaDev 1 nvmlDev 1 busId 180 commId 0x837dd0976e1b4338 - Init START
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO comm 0x55600fb9c840 rank 27 nranks 32 cudaDev 3 nvmlDev 3 busId 1a0 commId 0x837dd0976e1b4338 - Init START
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO comm 0x558a2e4d59c0 rank 31 nranks 32 cudaDev 7 nvmlDev 7 busId 1e0 commId 0x837dd0976e1b4338 - Init START
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO comm 0x55768c2aaf40 rank 29 nranks 32 cudaDev 5 nvmlDev 5 busId 1c0 commId 0x837dd0976e1b4338 - Init START
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO comm 0x55b4b90f7540 rank 25 nranks 32 cudaDev 1 nvmlDev 1 busId 180 commId 0x837dd0976e1b4338 - Init START
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO comm 0x556aae88b300 rank 24 nranks 32 cudaDev 0 nvmlDev 0 busId 170 commId 0x837dd0976e1b4338 - Init START
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO comm 0x5609cdbbff40 rank 30 nranks 32 cudaDev 6 nvmlDev 6 busId 1d0 commId 0x837dd0976e1b4338 - Init START
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO comm 0x55adee232c40 rank 26 nranks 32 cudaDev 2 nvmlDev 2 busId 190 commId 0x837dd0976e1b4338 - Init START
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO comm 0x5591c7c2f340 rank 13 nranks 32 cudaDev 5 nvmlDev 5 busId 1c0 commId 0x837dd0976e1b4338 - Init START
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO comm 0x560b8538e2c0 rank 10 nranks 32 cudaDev 2 nvmlDev 2 busId 190 commId 0x837dd0976e1b4338 - Init START
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO comm 0x55e5f0da8fc0 rank 14 nranks 32 cudaDev 6 nvmlDev 6 busId 1d0 commId 0x837dd0976e1b4338 - Init START
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO comm 0x55833c92f9c0 rank 15 nranks 32 cudaDev 7 nvmlDev 7 busId 1e0 commId 0x837dd0976e1b4338 - Init START
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO comm 0x564656b04380 rank 11 nranks 32 cudaDev 3 nvmlDev 3 busId 1a0 commId 0x837dd0976e1b4338 - Init START
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO comm 0x557b453cac40 rank 12 nranks 32 cudaDev 4 nvmlDev 4 busId 1b0 commId 0x837dd0976e1b4338 - Init START
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO comm 0x5648e3019c80 rank 8 nranks 32 cudaDev 0 nvmlDev 0 busId 170 commId 0x837dd0976e1b4338 - Init START
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO comm 0x5571768f15c0 rank 21 nranks 32 cudaDev 5 nvmlDev 5 busId 1c0 commId 0x837dd0976e1b4338 - Init START
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO comm 0x559fbed3b980 rank 20 nranks 32 cudaDev 4 nvmlDev 4 busId 1b0 commId 0x837dd0976e1b4338 - Init START
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO comm 0x5617488778d0 rank 19 nranks 32 cudaDev 3 nvmlDev 3 busId 1a0 commId 0x837dd0976e1b4338 - Init START
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO comm 0x565192948c00 rank 18 nranks 32 cudaDev 2 nvmlDev 2 busId 190 commId 0x837dd0976e1b4338 - Init START
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO comm 0x55e178f38c80 rank 17 nranks 32 cudaDev 1 nvmlDev 1 busId 180 commId 0x837dd0976e1b4338 - Init START
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO comm 0x556941882cc0 rank 16 nranks 32 cudaDev 0 nvmlDev 0 busId 170 commId 0x837dd0976e1b4338 - Init START
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO NVLS multicast support is not available on dev 4
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO NVLS multicast support is not available on dev 5
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO NVLS multicast support is not available on dev 7
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO NVLS multicast support is not available on dev 0
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO NVLS multicast support is not available on dev 5
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO NVLS multicast support is not available on dev 5
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO NVLS multicast support is not available on dev 6
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO NVLS multicast support is not available on dev 5
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO NVLS multicast support is not available on dev 1
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO NVLS multicast support is not available on dev 0
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO NVLS multicast support is not available on dev 2
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO NVLS multicast support is not available on dev 4
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO NVLS multicast support is not available on dev 6
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO NVLS multicast support is not available on dev 3
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO NVLS multicast support is not available on dev 3
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO NVLS multicast support is not available on dev 6
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO NVLS multicast support is not available on dev 6
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO NVLS multicast support is not available on dev 7
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO NVLS multicast support is not available on dev 1
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO NVLS multicast support is not available on dev 2
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO NVLS multicast support is not available on dev 2
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO NVLS multicast support is not available on dev 7
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO NVLS multicast support is not available on dev 3
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO NVLS multicast support is not available on dev 3
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO NVLS multicast support is not available on dev 1
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO NVLS multicast support is not available on dev 0
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO NVLS multicast support is not available on dev 2
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO NVLS multicast support is not available on dev 1
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO NVLS multicast support is not available on dev 0
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO NVLS multicast support is not available on dev 4
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO NVLS multicast support is not available on dev 4
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO NVLS multicast support is not available on dev 7
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO comm 0x559fbed3b980 rank 20 nRanks 32 nNodes 4 localRanks 8 localRank 4 MNNVL 0
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO comm 0x5640d3700dc0 rank 6 nRanks 32 nNodes 4 localRanks 8 localRank 6 MNNVL 0
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO comm 0x5648da365f80 rank 22 nRanks 32 nNodes 4 localRanks 8 localRank 6 MNNVL 0
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO Trees [0] 23/-1/-1->22->21 [1] 23/-1/-1->22->21
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO P2P Chunksize set to 131072
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO comm 0x5571768f15c0 rank 21 nRanks 32 nNodes 4 localRanks 8 localRank 5 MNNVL 0
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO Trees [0] 22/-1/-1->21->17 [1] 22/-1/-1->21->17
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO P2P Chunksize set to 131072
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO Trees [0] -1/-1/-1->20->23 [1] -1/-1/-1->20->23
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO P2P Chunksize set to 131072
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO comm 0x55768c2aaf40 rank 29 nRanks 32 nNodes 4 localRanks 8 localRank 5 MNNVL 0
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO comm 0x5582a4a63380 rank 23 nRanks 32 nNodes 4 localRanks 8 localRank 7 MNNVL 0
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO comm 0x563f66c76d80 rank 7 nRanks 32 nNodes 4 localRanks 8 localRank 7 MNNVL 0
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO Trees [0] 4/-1/-1->7->6 [1] 4/-1/-1->7->6
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO P2P Chunksize set to 131072
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO comm 0x55833c92f9c0 rank 15 nRanks 32 nNodes 4 localRanks 8 localRank 7 MNNVL 0
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO Trees [0] 20/-1/-1->23->22 [1] 20/-1/-1->23->22
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO P2P Chunksize set to 131072
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO P2P Chunksize set to 131072
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO Trees [0] 12/-1/-1->15->14 [1] 12/-1/-1->15->14
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO P2P Chunksize set to 131072
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO comm 0x5591c7c2f340 rank 13 nRanks 32 nNodes 4 localRanks 8 localRank 5 MNNVL 0
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO comm 0x5609cdbbff40 rank 30 nRanks 32 nNodes 4 localRanks 8 localRank 6 MNNVL 0
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO Trees [0] 31/-1/-1->30->29 [1] 31/-1/-1->30->29
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO comm 0x55e5f0da8fc0 rank 14 nRanks 32 nNodes 4 localRanks 8 localRank 6 MNNVL 0
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO comm 0x558a2e4d59c0 rank 31 nRanks 32 nNodes 4 localRanks 8 localRank 7 MNNVL 0
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO Trees [0] 28/-1/-1->31->30 [1] 28/-1/-1->31->30
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO comm 0x5653dbf7d100 rank 2 nRanks 32 nNodes 4 localRanks 8 localRank 2 MNNVL 0
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO Trees [0] 1/-1/-1->2->3 [1] 1/-1/-1->2->3
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO P2P Chunksize set to 131072
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO comm 0x564c64f91780 rank 5 nRanks 32 nNodes 4 localRanks 8 localRank 5 MNNVL 0
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO Trees [0] 6/-1/-1->5->1 [1] 6/-1/-1->5->1
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO P2P Chunksize set to 131072
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO Trees [0] 15/-1/-1->14->13 [1] 15/-1/-1->14->13
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO P2P Chunksize set to 131072
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO comm 0x55bdd0698c80 rank 28 nRanks 32 nNodes 4 localRanks 8 localRank 4 MNNVL 0
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO comm 0x559fe7380640 rank 1 nRanks 32 nNodes 4 localRanks 8 localRank 1 MNNVL 0
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO Trees [0] 5/-1/-1->1->2 [1] 5/-1/-1->1->2
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO P2P Chunksize set to 131072
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO comm 0x556aae88b300 rank 24 nRanks 32 nNodes 4 localRanks 8 localRank 0 MNNVL 0
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO comm 0x55bc31870680 rank 0 nRanks 32 nNodes 4 localRanks 8 localRank 0 MNNVL 0
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Channel 00/02 :    0   3   2   1   5   6   7   4   8  11  10   9  13  14  15  12  16  19  18  17
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Channel 01/02 :    0   3   2   1   5   6   7   4   8  11  10   9  13  14  15  12  16  19  18  17
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Trees [0] 3/16/-1->0->-1 [1] 3/-1/-1->0->8
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO P2P Chunksize set to 131072
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO Trees [0] 30/-1/-1->29->25 [1] 30/-1/-1->29->25
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO P2P Chunksize set to 131072
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO comm 0x55986c6b8ac0 rank 3 nRanks 32 nNodes 4 localRanks 8 localRank 3 MNNVL 0
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO Trees [0] 2/-1/-1->3->0 [1] 2/-1/-1->3->0
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO P2P Chunksize set to 131072
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO Trees [0] 14/-1/-1->13->9 [1] 14/-1/-1->13->9
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO P2P Chunksize set to 131072
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO comm 0x55600fb9c840 rank 27 nRanks 32 nNodes 4 localRanks 8 localRank 3 MNNVL 0
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO Trees [0] 26/-1/-1->27->24 [1] 26/-1/-1->27->24
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO P2P Chunksize set to 131072
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO comm 0x5641c3f88e40 rank 4 nRanks 32 nNodes 4 localRanks 8 localRank 4 MNNVL 0
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO Trees [0] -1/-1/-1->4->7 [1] -1/-1/-1->4->7
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO P2P Chunksize set to 131072
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO comm 0x560b8538e2c0 rank 10 nRanks 32 nNodes 4 localRanks 8 localRank 2 MNNVL 0
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO Trees [0] 9/-1/-1->10->11 [1] 9/-1/-1->10->11
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO P2P Chunksize set to 131072
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO P2P Chunksize set to 131072
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO comm 0x5648e3019c80 rank 8 nRanks 32 nNodes 4 localRanks 8 localRank 0 MNNVL 0
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO comm 0x55adee232c40 rank 26 nRanks 32 nNodes 4 localRanks 8 localRank 2 MNNVL 0
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO Trees [0] 25/-1/-1->26->27 [1] 25/-1/-1->26->27
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO P2P Chunksize set to 131072
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO comm 0x5613f202fdc0 rank 9 nRanks 32 nNodes 4 localRanks 8 localRank 1 MNNVL 0
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO comm 0x55b4b90f7540 rank 25 nRanks 32 nNodes 4 localRanks 8 localRank 1 MNNVL 0
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO Trees [0] 29/-1/-1->25->26 [1] 29/-1/-1->25->26
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO P2P Chunksize set to 131072
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO comm 0x564656b04380 rank 11 nRanks 32 nNodes 4 localRanks 8 localRank 3 MNNVL 0
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO Trees [0] 10/-1/-1->11->8 [1] 10/16/-1->11->8
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO P2P Chunksize set to 131072
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO P2P Chunksize set to 131072
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO comm 0x557b453cac40 rank 12 nRanks 32 nNodes 4 localRanks 8 localRank 4 MNNVL 0
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO Trees [0] -1/-1/-1->12->15 [1] -1/-1/-1->12->15
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO P2P Chunksize set to 131072
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO Trees [0] -1/-1/-1->28->31 [1] -1/-1/-1->28->31
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO P2P Chunksize set to 131072
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Trees [0] 11/-1/-1->8->19 [1] 11/0/-1->8->24
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO P2P Chunksize set to 131072
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO Trees [0] 27/-1/-1->24->16 [1] 27/8/-1->24->-1
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO P2P Chunksize set to 131072
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO Trees [0] 13/-1/-1->9->10 [1] 13/-1/-1->9->10
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO P2P Chunksize set to 131072
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO comm 0x5617488778d0 rank 19 nRanks 32 nNodes 4 localRanks 8 localRank 3 MNNVL 0
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO comm 0x565192948c00 rank 18 nRanks 32 nNodes 4 localRanks 8 localRank 2 MNNVL 0
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO Trees [0] 18/8/-1->19->16 [1] 18/-1/-1->19->16
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO P2P Chunksize set to 131072
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO comm 0x55e178f38c80 rank 17 nRanks 32 nNodes 4 localRanks 8 localRank 1 MNNVL 0
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO comm 0x556941882cc0 rank 16 nRanks 32 nNodes 4 localRanks 8 localRank 0 MNNVL 0
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO Trees [0] 17/-1/-1->18->19 [1] 17/-1/-1->18->19
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO P2P Chunksize set to 131072
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Trees [0] 19/24/-1->16->0 [1] 19/-1/-1->16->11
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO P2P Chunksize set to 131072
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO Trees [0] 21/-1/-1->17->18 [1] 21/-1/-1->17->18
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO P2P Chunksize set to 131072
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO Channel 00/0 : 13[5] -> 14[6] via P2P/CUMEM
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO Channel 00/0 : 29[5] -> 30[6] via P2P/CUMEM
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Channel 00/0 : 8[0] -> 11[3] via P2P/CUMEM
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO Channel 00/0 : 25[1] -> 29[5] via P2P/CUMEM
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Channel 00/0 : 0[0] -> 3[3] via P2P/CUMEM
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO Channel 00/0 : 9[1] -> 13[5] via P2P/CUMEM
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO Channel 00/0 : 1[1] -> 5[5] via P2P/CUMEM
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO Channel 00/0 : 24[0] -> 27[3] via P2P/CUMEM
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO Channel 00/0 : 5[5] -> 6[6] via P2P/CUMEM
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO Channel 00/0 : 22[6] -> 23[7] via P2P/CUMEM
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO Channel 01/0 : 13[5] -> 14[6] via P2P/CUMEM
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO Channel 01/0 : 29[5] -> 30[6] via P2P/CUMEM
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Channel 01/0 : 8[0] -> 11[3] via P2P/CUMEM
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO Channel 01/0 : 9[1] -> 13[5] via P2P/CUMEM
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO Channel 01/0 : 25[1] -> 29[5] via P2P/CUMEM
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO Channel 01/0 : 24[0] -> 27[3] via P2P/CUMEM
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Channel 01/0 : 0[0] -> 3[3] via P2P/CUMEM
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO Channel 00/0 : 18[2] -> 17[1] via P2P/CUMEM
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO Channel 01/0 : 1[1] -> 5[5] via P2P/CUMEM
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO Channel 01/0 : 22[6] -> 23[7] via P2P/CUMEM
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO Channel 00/0 : 21[5] -> 22[6] via P2P/CUMEM
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO Channel 01/0 : 5[5] -> 6[6] via P2P/CUMEM
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO Channel 00/0 : 14[6] -> 15[7] via P2P/CUMEM
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO Channel 00/0 : 10[2] -> 9[1] via P2P/CUMEM
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO Channel 01/0 : 18[2] -> 17[1] via P2P/CUMEM
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO Channel 00/0 : 30[6] -> 31[7] via P2P/CUMEM
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO Channel 00/0 : 23[7] -> 20[4] via P2P/CUMEM
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO Channel 00/0 : 26[2] -> 25[1] via P2P/CUMEM
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO Channel 01/0 : 14[6] -> 15[7] via P2P/CUMEM
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO Channel 01/0 : 10[2] -> 9[1] via P2P/CUMEM
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO Channel 01/0 : 21[5] -> 22[6] via P2P/CUMEM
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO Channel 00/0 : 6[6] -> 7[7] via P2P/CUMEM
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO Channel 00/0 : 27[3] -> 26[2] via P2P/CUMEM
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO Channel 01/0 : 30[6] -> 31[7] via P2P/CUMEM
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO Channel 01/0 : 26[2] -> 25[1] via P2P/CUMEM
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO Channel 00/0 : 2[2] -> 1[1] via P2P/CUMEM
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO Channel 00/0 : 15[7] -> 12[4] via P2P/CUMEM
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO Channel 00/0 : 3[3] -> 2[2] via P2P/CUMEM
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO Channel 01/0 : 6[6] -> 7[7] via P2P/CUMEM
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO Channel 01/0 : 27[3] -> 26[2] via P2P/CUMEM
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO Channel 00/0 : 11[3] -> 10[2] via P2P/CUMEM
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO Channel 01/0 : 2[2] -> 1[1] via P2P/CUMEM
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO Channel 00/0 : 17[1] -> 21[5] via P2P/CUMEM
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO Channel 00/0 : 20[4] -> 24[0] [send] via NET/Socket/0
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO Channel 00/0 : 31[7] -> 28[4] via P2P/CUMEM
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO Channel 01/0 : 20[4] -> 24[0] [send] via NET/Socket/0
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO Channel 01/0 : 3[3] -> 2[2] via P2P/CUMEM
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO Channel 01/0 : 15[7] -> 12[4] via P2P/CUMEM
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO Channel 01/0 : 23[7] -> 20[4] via P2P/CUMEM
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO Channel 01/0 : 11[3] -> 10[2] via P2P/CUMEM
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO Channel 00/0 : 7[7] -> 4[4] via P2P/CUMEM
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO Channel 01/0 : 31[7] -> 28[4] via P2P/CUMEM
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO Channel 00/0 : 12[4] -> 16[0] [send] via NET/Socket/0
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO Channel 00/0 : 28[4] -> 0[0] [send] via NET/Socket/0
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO Channel 01/0 : 17[1] -> 21[5] via P2P/CUMEM
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO Channel 01/0 : 12[4] -> 16[0] [send] via NET/Socket/0
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO Channel 01/0 : 28[4] -> 0[0] [send] via NET/Socket/0
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO Channel 01/0 : 7[7] -> 4[4] via P2P/CUMEM
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO Channel 00/0 : 4[4] -> 8[0] [send] via NET/Socket/0
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO Channel 01/0 : 4[4] -> 8[0] [send] via NET/Socket/0
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Channel 00/0 : 16[0] -> 19[3] via P2P/CUMEM
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Channel 01/0 : 16[0] -> 19[3] via P2P/CUMEM
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO Channel 00/0 : 19[3] -> 18[2] via P2P/CUMEM
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO Channel 00/0 : 20[4] -> 24[0] [receive] via NET/Socket/0
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO Channel 01/0 : 19[3] -> 18[2] via P2P/CUMEM
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO Channel 01/0 : 20[4] -> 24[0] [receive] via NET/Socket/0
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Channel 00/0 : 28[4] -> 0[0] [receive] via NET/Socket/0
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Channel 01/0 : 28[4] -> 0[0] [receive] via NET/Socket/0
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Channel 00/0 : 4[4] -> 8[0] [receive] via NET/Socket/0
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Channel 01/0 : 4[4] -> 8[0] [receive] via NET/Socket/0
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Channel 00/0 : 12[4] -> 16[0] [receive] via NET/Socket/0
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Channel 01/0 : 12[4] -> 16[0] [receive] via NET/Socket/0
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO Connected all rings
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO Channel 00/0 : 29[5] -> 25[1] via P2P/CUMEM
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO Connected all rings
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO Channel 00/0 : 25[1] -> 26[2] via P2P/CUMEM
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO Channel 01/0 : 29[5] -> 25[1] via P2P/CUMEM
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO Channel 01/0 : 25[1] -> 26[2] via P2P/CUMEM
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO Connected all rings
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO Channel 00/0 : 13[5] -> 9[1] via P2P/CUMEM
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO Connected all rings
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO Channel 00/0 : 9[1] -> 10[2] via P2P/CUMEM
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO Channel 01/0 : 13[5] -> 9[1] via P2P/CUMEM
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO Channel 01/0 : 9[1] -> 10[2] via P2P/CUMEM
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO Connected all rings
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO Connected all rings
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO Channel 00/0 : 21[5] -> 17[1] via P2P/CUMEM
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO Channel 00/0 : 30[6] -> 29[5] via P2P/CUMEM
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO Connected all rings
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO Channel 00/0 : 17[1] -> 18[2] via P2P/CUMEM
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO Channel 01/0 : 21[5] -> 17[1] via P2P/CUMEM
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO Channel 01/0 : 30[6] -> 29[5] via P2P/CUMEM
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO Connected all rings
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO Connected all rings
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO Channel 00/0 : 5[5] -> 1[1] via P2P/CUMEM
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO Connected all rings
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO Channel 00/0 : 1[1] -> 2[2] via P2P/CUMEM
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO Channel 00/0 : 14[6] -> 13[5] via P2P/CUMEM
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO Channel 01/0 : 17[1] -> 18[2] via P2P/CUMEM
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO Connected all rings
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO Channel 01/0 : 14[6] -> 13[5] via P2P/CUMEM
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO Connected all rings
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO Connected all rings
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO Channel 01/0 : 5[5] -> 1[1] via P2P/CUMEM
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO Channel 01/0 : 1[1] -> 2[2] via P2P/CUMEM
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO Connected all rings
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO Channel 00/0 : 22[6] -> 21[5] via P2P/CUMEM
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Connected all rings
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO Connected all rings
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Channel 01/0 : 0[0] -> 8[0] [receive] via NET/Socket/0
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO Connected all rings
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO Channel 00/0 : 4[4] -> 7[7] via P2P/CUMEM
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO Connected all rings
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO Connected all rings
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO Channel 01/0 : 22[6] -> 21[5] via P2P/CUMEM
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO Channel 00/0 : 10[2] -> 11[3] via P2P/CUMEM
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO Channel 01/0 : 10[2] -> 11[3] via P2P/CUMEM
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO Connected all rings
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO Channel 00/0 : 16[0] -> 24[0] [receive] via NET/Socket/0
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO Channel 01/0 : 4[4] -> 7[7] via P2P/CUMEM
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO Channel 00/0 : 26[2] -> 27[3] via P2P/CUMEM
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO Connected all rings
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO Connected all rings
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO Channel 00/0 : 6[6] -> 5[5] via P2P/CUMEM
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO Connected all rings
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO Connected all rings
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO Channel 00/0 : 20[4] -> 23[7] via P2P/CUMEM
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO Channel 01/0 : 26[2] -> 27[3] via P2P/CUMEM
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Connected all rings
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Channel 01/0 : 0[0] -> 8[0] [send] via NET/Socket/0
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO Connected all rings
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO Connected all rings
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO Channel 00/0 : 28[4] -> 31[7] via P2P/CUMEM
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO Channel 01/0 : 20[4] -> 23[7] via P2P/CUMEM
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO Channel 00/0 : 7[7] -> 6[6] via P2P/CUMEM
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO Channel 00/0 : 27[3] -> 24[0] via P2P/CUMEM
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO Channel 01/0 : 6[6] -> 5[5] via P2P/CUMEM
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Channel 00/0 : 16[0] -> 0[0] [receive] via NET/Socket/0
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Channel 00/0 : 0[0] -> 16[0] [send] via NET/Socket/0
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Channel 00/0 : 8[0] -> 19[3] [send] via NET/Socket/0
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO Channel 00/0 : 23[7] -> 22[6] via P2P/CUMEM
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO Channel 01/0 : 28[4] -> 31[7] via P2P/CUMEM
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO Channel 01/0 : 7[7] -> 6[6] via P2P/CUMEM
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO Channel 01/0 : 27[3] -> 24[0] via P2P/CUMEM
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO Channel 01/0 : 23[7] -> 22[6] via P2P/CUMEM
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO Channel 00/0 : 31[7] -> 30[6] via P2P/CUMEM
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO Channel 00/0 : 3[3] -> 0[0] via P2P/CUMEM
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO Channel 01/0 : 31[7] -> 30[6] via P2P/CUMEM
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO Connected all rings
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO Connected all rings
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO Channel 01/0 : 3[3] -> 0[0] via P2P/CUMEM
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Connected all rings
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO Channel 01/0 : 11[3] -> 16[0] [send] via NET/Socket/0
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO Connected all rings
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO Connected all rings
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO Channel 00/0 : 12[4] -> 15[7] via P2P/CUMEM
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Channel 01/0 : 11[3] -> 16[0] [receive] via NET/Socket/0
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO Channel 00/0 : 18[2] -> 19[3] via P2P/CUMEM
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO Channel 01/0 : 16[0] -> 11[3] [receive] via NET/Socket/0
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO Channel 01/0 : 12[4] -> 15[7] via P2P/CUMEM
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Channel 00/0 : 16[0] -> 24[0] [send] via NET/Socket/0
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO Channel 01/0 : 18[2] -> 19[3] via P2P/CUMEM
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO Channel 00/0 : 15[7] -> 14[6] via P2P/CUMEM
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Channel 00/0 : 0[0] -> 16[0] [receive] via NET/Socket/0
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO Channel 01/0 : 8[0] -> 24[0] [receive] via NET/Socket/0
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Channel 00/0 : 16[0] -> 0[0] [send] via NET/Socket/0
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO Channel 01/0 : 24[0] -> 8[0] [send] via NET/Socket/0
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO Channel 01/0 : 15[7] -> 14[6] via P2P/CUMEM
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Channel 00/0 : 24[0] -> 16[0] [receive] via NET/Socket/0
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Channel 01/0 : 8[0] -> 0[0] [receive] via NET/Socket/0
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO Connected all trees
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO Connected all trees
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO Channel 00/1 : 9[1] -> 12[4] via P2P/indirect/8[0]
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO Channel 00/0 : 8[0] -> 19[3] [receive] via NET/Socket/0
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Channel 01/0 : 24[0] -> 8[0] [receive] via NET/Socket/0
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO Channel 00/0 : 19[3] -> 8[0] [send] via NET/Socket/0
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Channel 01/0 : 8[0] -> 24[0] [send] via NET/Socket/0
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Channel 00/0 : 19[3] -> 8[0] [receive] via NET/Socket/0
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO Channel 00/0 : 24[0] -> 16[0] [send] via NET/Socket/0
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Channel 01/0 : 8[0] -> 0[0] [send] via NET/Socket/0
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO Connected all trees
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO Channel 00/0 : 19[3] -> 16[0] via P2P/CUMEM
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Channel 01/0 : 16[0] -> 11[3] [send] via NET/Socket/0
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO Channel 00/0 : 11[3] -> 8[0] via P2P/CUMEM
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO Channel 01/0 : 19[3] -> 16[0] via P2P/CUMEM
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO Connected all trees
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO Channel 00/1 : 1[1] -> 4[4] via P2P/indirect/0[0]
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO Connected all trees
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO Connected all trees
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO Connected all trees
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO Connected all trees
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO Connected all trees
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO Connected all trees
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO Channel 00/1 : 25[1] -> 28[4] via P2P/indirect/24[0]
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO Connected all trees
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO Connected all trees
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO Channel 01/0 : 11[3] -> 8[0] via P2P/CUMEM
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO Connected all trees
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO Connected all trees
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO Connected all trees
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO Connected all trees
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO Connected all trees
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO Connected all trees
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO Channel 00/1 : 17[1] -> 20[4] via P2P/indirect/16[0]
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO Connected all trees
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO Connected all trees
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO Connected all trees
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO Channel 00/1 : 26[2] -> 28[4] via P2P/indirect/24[0]
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO Connected all trees
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO Channel 00/1 : 2[2] -> 4[4] via P2P/indirect/0[0]
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO Connected all trees
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO Channel 00/1 : 18[2] -> 20[4] via P2P/indirect/16[0]
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO Connected all trees
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO Channel 00/1 : 10[2] -> 12[4] via P2P/indirect/8[0]
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO Connected all trees
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Connected all trees
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO Channel 00/1 : 3[3] -> 4[4] via P2P/indirect/0[0]
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Channel 00/1 : 0[0] -> 5[5] via P2P/indirect/1[1]
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO Channel 00/1 : 3[3] -> 5[5] via P2P/indirect/1[1]
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO Connected all trees
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO Channel 00/1 : 24[0] -> 29[5] via P2P/indirect/25[1]
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Connected all trees
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO Connected all trees
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO Channel 00/1 : 11[3] -> 12[4] via P2P/indirect/8[0]
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Channel 00/1 : 8[0] -> 13[5] via P2P/indirect/9[1]
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO Connected all trees
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO Channel 00/1 : 19[3] -> 20[4] via P2P/indirect/16[0]
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Connected all trees
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Channel 00/1 : 16[0] -> 21[5] via P2P/indirect/17[1]
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO Channel 00/1 : 3[3] -> 6[6] via P2P/indirect/7[7]
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO Connected all trees
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO threadThresholds 8/8/64 | 256/8/64 | 512 | 512
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO 2 coll channels, 0 collnet channels, 0 nvls channels, 2 p2p channels, 1 p2p channels per peer
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO Channel 00/1 : 27[3] -> 28[4] via P2P/indirect/24[0]
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO Channel 00/1 : 11[3] -> 13[5] via P2P/indirect/9[1]
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO Channel 00/1 : 19[3] -> 21[5] via P2P/indirect/17[1]
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO Channel 00/1 : 2[2] -> 5[5] via P2P/indirect/1[1]
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO Channel 00/1 : 27[3] -> 29[5] via P2P/indirect/25[1]
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO Channel 00/1 : 27[3] -> 30[6] via P2P/indirect/31[7]
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO Channel 00/1 : 19[3] -> 22[6] via P2P/indirect/23[7]
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO Channel 00/1 : 11[3] -> 14[6] via P2P/indirect/15[7]
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO Channel 00/1 : 10[2] -> 13[5] via P2P/indirect/9[1]
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO Channel 00/1 : 18[2] -> 21[5] via P2P/indirect/17[1]
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO Channel 00/1 : 10[2] -> 15[7] via P2P/indirect/14[6]
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO Channel 00/1 : 20[4] -> 17[1] via P2P/indirect/21[5]
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO Channel 00/1 : 12[4] -> 9[1] via P2P/indirect/13[5]
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO Channel 00/1 : 17[1] -> 22[6] via P2P/indirect/21[5]
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO Channel 00/1 : 9[1] -> 14[6] via P2P/indirect/13[5]
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO Channel 00/1 : 26[2] -> 29[5] via P2P/indirect/25[1]
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO Channel 00/1 : 18[2] -> 23[7] via P2P/indirect/22[6]
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO Channel 00/1 : 2[2] -> 7[7] via P2P/indirect/6[6]
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO Channel 00/1 : 1[1] -> 6[6] via P2P/indirect/5[5]
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO Channel 00/1 : 28[4] -> 25[1] via P2P/indirect/29[5]
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO Channel 00/1 : 4[4] -> 1[1] via P2P/indirect/5[5]
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO Channel 00/1 : 26[2] -> 31[7] via P2P/indirect/30[6]
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO Channel 00/1 : 13[5] -> 8[0] via P2P/indirect/12[4]
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Channel 00/1 : 8[0] -> 14[6] via P2P/indirect/12[4]
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO Channel 00/1 : 9[1] -> 15[7] via P2P/indirect/11[3]
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Channel 00/1 : 0[0] -> 6[6] via P2P/indirect/4[4]
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO Channel 00/1 : 5[5] -> 0[0] via P2P/indirect/4[4]
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Channel 00/1 : 16[0] -> 22[6] via P2P/indirect/20[4]
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO Channel 00/1 : 21[5] -> 16[0] via P2P/indirect/20[4]
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO Channel 00/1 : 25[1] -> 30[6] via P2P/indirect/29[5]
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO Channel 00/1 : 17[1] -> 23[7] via P2P/indirect/19[3]
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO Channel 00/1 : 1[1] -> 7[7] via P2P/indirect/3[3]
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO Channel 00/1 : 8[0] -> 15[7] via P2P/indirect/12[4]
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO Channel 00/1 : 14[6] -> 8[0] via P2P/indirect/12[4]
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO Channel 00/1 : 24[0] -> 30[6] via P2P/indirect/28[4]
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO Channel 00/1 : 29[5] -> 24[0] via P2P/indirect/28[4]
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO Channel 00/1 : 25[1] -> 31[7] via P2P/indirect/27[3]
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO Channel 00/1 : 15[7] -> 8[0] via P2P/indirect/12[4]
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO Channel 00/1 : 0[0] -> 7[7] via P2P/indirect/4[4]
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO Channel 00/1 : 6[6] -> 0[0] via P2P/indirect/4[4]
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO Channel 00/1 : 15[7] -> 9[1] via P2P/indirect/13[5]
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO Channel 00/1 : 22[6] -> 16[0] via P2P/indirect/20[4]
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO Channel 00/1 : 16[0] -> 23[7] via P2P/indirect/20[4]
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO Channel 00/1 : 7[7] -> 0[0] via P2P/indirect/4[4]
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO Channel 00/1 : 14[6] -> 9[1] via P2P/indirect/13[5]
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO Channel 00/1 : 15[7] -> 10[2] via P2P/indirect/11[3]
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO Channel 00/1 : 23[7] -> 16[0] via P2P/indirect/20[4]
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO Channel 00/1 : 7[7] -> 1[1] via P2P/indirect/5[5]
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO Channel 00/1 : 24[0] -> 31[7] via P2P/indirect/28[4]
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO Channel 00/1 : 30[6] -> 24[0] via P2P/indirect/28[4]
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO Channel 00/1 : 13[5] -> 10[2] via P2P/indirect/9[1]
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO Channel 00/1 : 6[6] -> 1[1] via P2P/indirect/5[5]
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO Channel 00/1 : 31[7] -> 24[0] via P2P/indirect/28[4]
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO Channel 00/1 : 23[7] -> 17[1] via P2P/indirect/21[5]
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO Channel 00/1 : 7[7] -> 2[2] via P2P/indirect/3[3]
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO Channel 00/1 : 22[6] -> 17[1] via P2P/indirect/21[5]
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO Channel 00/1 : 5[5] -> 2[2] via P2P/indirect/1[1]
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO Channel 00/1 : 23[7] -> 18[2] via P2P/indirect/19[3]
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO Channel 00/1 : 21[5] -> 18[2] via P2P/indirect/17[1]
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO Channel 00/1 : 31[7] -> 25[1] via P2P/indirect/29[5]
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO Channel 00/1 : 6[6] -> 3[3] via P2P/indirect/2[2]
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO Channel 00/1 : 14[6] -> 11[3] via P2P/indirect/10[2]
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO Channel 00/1 : 22[6] -> 19[3] via P2P/indirect/18[2]
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO Channel 00/1 : 30[6] -> 25[1] via P2P/indirect/29[5]
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO Channel 00/1 : 31[7] -> 26[2] via P2P/indirect/27[3]
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO Channel 00/1 : 5[5] -> 3[3] via P2P/indirect/1[1]
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO Channel 00/1 : 4[4] -> 2[2] via P2P/indirect/6[6]
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO Channel 00/1 : 29[5] -> 26[2] via P2P/indirect/25[1]
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO Channel 00/1 : 21[5] -> 19[3] via P2P/indirect/17[1]
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO Channel 00/1 : 13[5] -> 11[3] via P2P/indirect/9[1]
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO Channel 00/1 : 20[4] -> 18[2] via P2P/indirect/22[6]
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO Channel 00/1 : 30[6] -> 27[3] via P2P/indirect/26[2]
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO Channel 00/1 : 29[5] -> 27[3] via P2P/indirect/25[1]
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO Channel 00/1 : 4[4] -> 3[3] via P2P/indirect/0[0]
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO Channel 00/1 : 28[4] -> 26[2] via P2P/indirect/30[6]
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO Channel 00/1 : 12[4] -> 10[2] via P2P/indirect/14[6]
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO Channel 00/1 : 20[4] -> 19[3] via P2P/indirect/16[0]
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO Channel 00/1 : 28[4] -> 27[3] via P2P/indirect/24[0]
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO Channel 00/1 : 12[4] -> 11[3] via P2P/indirect/8[0]
[1,17]<stdout>:bert-mpi-training-worker-2:22:571 [1] NCCL INFO comm 0x55e178f38c80 rank 17 nranks 32 cudaDev 1 nvmlDev 1 busId 180 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,16]<stdout>:bert-mpi-training-worker-2:21:567 [0] NCCL INFO comm 0x556941882cc0 rank 16 nranks 32 cudaDev 0 nvmlDev 0 busId 170 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,23]<stdout>:bert-mpi-training-worker-2:28:569 [7] NCCL INFO comm 0x5582a4a63380 rank 23 nranks 32 cudaDev 7 nvmlDev 7 busId 1e0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,19]<stdout>:bert-mpi-training-worker-2:24:573 [3] NCCL INFO comm 0x5617488778d0 rank 19 nranks 32 cudaDev 3 nvmlDev 3 busId 1a0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,21]<stdout>:bert-mpi-training-worker-2:26:570 [5] NCCL INFO comm 0x5571768f15c0 rank 21 nranks 32 cudaDev 5 nvmlDev 5 busId 1c0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,20]<stdout>:bert-mpi-training-worker-2:25:574 [4] NCCL INFO comm 0x559fbed3b980 rank 20 nranks 32 cudaDev 4 nvmlDev 4 busId 1b0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,18]<stdout>:bert-mpi-training-worker-2:23:572 [2] NCCL INFO comm 0x565192948c00 rank 18 nranks 32 cudaDev 2 nvmlDev 2 busId 190 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,22]<stdout>:bert-mpi-training-worker-2:27:568 [6] NCCL INFO comm 0x5648da365f80 rank 22 nranks 32 cudaDev 6 nvmlDev 6 busId 1d0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,12]<stdout>:bert-mpi-training-worker-1:25:573 [4] NCCL INFO comm 0x557b453cac40 rank 12 nranks 32 cudaDev 4 nvmlDev 4 busId 1b0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,8]<stdout>:bert-mpi-training-worker-1:21:568 [0] NCCL INFO comm 0x5648e3019c80 rank 8 nranks 32 cudaDev 0 nvmlDev 0 busId 170 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,10]<stdout>:bert-mpi-training-worker-1:23:574 [2] NCCL INFO comm 0x560b8538e2c0 rank 10 nranks 32 cudaDev 2 nvmlDev 2 busId 190 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,14]<stdout>:bert-mpi-training-worker-1:27:571 [6] NCCL INFO comm 0x55e5f0da8fc0 rank 14 nranks 32 cudaDev 6 nvmlDev 6 busId 1d0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,13]<stdout>:bert-mpi-training-worker-1:26:569 [5] NCCL INFO comm 0x5591c7c2f340 rank 13 nranks 32 cudaDev 5 nvmlDev 5 busId 1c0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,9]<stdout>:bert-mpi-training-worker-1:22:572 [1] NCCL INFO comm 0x5613f202fdc0 rank 9 nranks 32 cudaDev 1 nvmlDev 1 busId 180 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,11]<stdout>:bert-mpi-training-worker-1:24:570 [3] NCCL INFO comm 0x564656b04380 rank 11 nranks 32 cudaDev 3 nvmlDev 3 busId 1a0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,15]<stdout>:bert-mpi-training-worker-1:28:567 [7] NCCL INFO comm 0x55833c92f9c0 rank 15 nranks 32 cudaDev 7 nvmlDev 7 busId 1e0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,5]<stdout>:bert-mpi-training-worker-0:26:574 [5] NCCL INFO comm 0x564c64f91780 rank 5 nranks 32 cudaDev 5 nvmlDev 5 busId 1c0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,1]<stdout>:bert-mpi-training-worker-0:22:572 [1] NCCL INFO comm 0x559fe7380640 rank 1 nranks 32 cudaDev 1 nvmlDev 1 busId 180 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,3]<stdout>:bert-mpi-training-worker-0:24:577 [3] NCCL INFO comm 0x55986c6b8ac0 rank 3 nranks 32 cudaDev 3 nvmlDev 3 busId 1a0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,7]<stdout>:bert-mpi-training-worker-0:28:571 [7] NCCL INFO comm 0x563f66c76d80 rank 7 nranks 32 cudaDev 7 nvmlDev 7 busId 1e0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,25]<stdout>:bert-mpi-training-worker-3:22:568 [1] NCCL INFO comm 0x55b4b90f7540 rank 25 nranks 32 cudaDev 1 nvmlDev 1 busId 180 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,29]<stdout>:bert-mpi-training-worker-3:26:567 [5] NCCL INFO comm 0x55768c2aaf40 rank 29 nranks 32 cudaDev 5 nvmlDev 5 busId 1c0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,30]<stdout>:bert-mpi-training-worker-3:27:569 [6] NCCL INFO comm 0x5609cdbbff40 rank 30 nranks 32 cudaDev 6 nvmlDev 6 busId 1d0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,31]<stdout>:bert-mpi-training-worker-3:28:570 [7] NCCL INFO comm 0x558a2e4d59c0 rank 31 nranks 32 cudaDev 7 nvmlDev 7 busId 1e0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,27]<stdout>:bert-mpi-training-worker-3:24:573 [3] NCCL INFO comm 0x55600fb9c840 rank 27 nranks 32 cudaDev 3 nvmlDev 3 busId 1a0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,28]<stdout>:bert-mpi-training-worker-3:25:572 [4] NCCL INFO comm 0x55bdd0698c80 rank 28 nranks 32 cudaDev 4 nvmlDev 4 busId 1b0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,24]<stdout>:bert-mpi-training-worker-3:21:571 [0] NCCL INFO comm 0x556aae88b300 rank 24 nranks 32 cudaDev 0 nvmlDev 0 busId 170 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,26]<stdout>:bert-mpi-training-worker-3:23:574 [2] NCCL INFO comm 0x55adee232c40 rank 26 nranks 32 cudaDev 2 nvmlDev 2 busId 190 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,2]<stdout>:bert-mpi-training-worker-0:23:576 [2] NCCL INFO comm 0x5653dbf7d100 rank 2 nranks 32 cudaDev 2 nvmlDev 2 busId 190 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,6]<stdout>:bert-mpi-training-worker-0:27:575 [6] NCCL INFO comm 0x5640d3700dc0 rank 6 nranks 32 cudaDev 6 nvmlDev 6 busId 1d0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,4]<stdout>:bert-mpi-training-worker-0:25:573 [4] NCCL INFO comm 0x5641c3f88e40 rank 4 nranks 32 cudaDev 4 nvmlDev 4 busId 1b0 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,0]<stdout>:bert-mpi-training-worker-0:21:570 [0] NCCL INFO comm 0x55bc31870680 rank 0 nranks 32 cudaDev 0 nvmlDev 0 busId 170 commId 0x837dd0976e1b4338 - Init COMPLETE
[1,31]<stdout>:Process 31 - Training time: 10.09 seconds
[1,31]<stdout>:Process 31 - Throughput: 9.91 samples/second
[1,29]<stdout>:Process 29 - Training time: 10.05 seconds
[1,29]<stdout>:Process 29 - Throughput: 9.95 samples/second
[1,28]<stdout>:Process 28 - Training time: 10.09 seconds
[1,28]<stdout>:Process 28 - Throughput: 9.91 samples/second
[1,25]<stdout>:Process 25 - Training time: 10.04 seconds
[1,25]<stdout>:Process 25 - Throughput: 9.96 samples/second
[1,27]<stdout>:Process 27 - Training time: 10.10 seconds
[1,27]<stdout>:Process 27 - Throughput: 9.90 samples/second
[1,20]<stdout>:Process 20 - Training time: 10.09 seconds
[1,20]<stdout>:Process 20 - Throughput: 9.91 samples/second
[1,3]<stdout>:Process 3 - Training time: 10.07 seconds
[1,3]<stdout>:Process 3 - Throughput: 9.93 samples/second
[1,0]<stdout>:Process 0 - Training time: 10.03 seconds
[1,0]<stdout>:Process 0 - Throughput: 9.97 samples/second
[1,23]<stdout>:Process 23 - Training time: 10.04 seconds
[1,23]<stdout>:Process 23 - Throughput: 9.96 samples/second
[1,24]<stdout>:Process 24 - Training time: 10.10 seconds
[1,24]<stdout>:Process 24 - Throughput: 9.90 samples/second
[1,2]<stdout>:Process 2 - Training time: 10.14 seconds
[1,2]<stdout>:Process 2 - Throughput: 9.86 samples/second
[1,5]<stdout>:Process 5 - Training time: 10.08 seconds
[1,5]<stdout>:Process 5 - Throughput: 9.92 samples/second
[1,21]<stdout>:Process 21 - Training time: 10.08 seconds
[1,21]<stdout>:Process 21 - Throughput: 9.92 samples/second
[1,22]<stdout>:Process 22 - Training time: 10.07 seconds
[1,22]<stdout>:Process 22 - Throughput: 9.93 samples/second
[1,30]<stdout>:Process 30 - Training time: 10.09 seconds
[1,30]<stdout>:Process 30 - Throughput: 9.91 samples/second
[1,1]<stdout>:Process 1 - Training time: 10.07 seconds
[1,1]<stdout>:Process 1 - Throughput: 9.93 samples/second
[1,17]<stdout>:Process 17 - Training time: 10.11 seconds
[1,17]<stdout>:Process 17 - Throughput: 9.89 samples/second
[1,12]<stdout>:Process 12 - Training time: 10.01 seconds
[1,12]<stdout>:Process 12 - Throughput: 9.99 samples/second
[1,6]<stdout>:Process 6 - Training time: 10.04 seconds
[1,6]<stdout>:Process 6 - Throughput: 9.96 samples/second
[1,18]<stdout>:Process 18 - Training time: 10.12 seconds
[1,18]<stdout>:Process 18 - Throughput: 9.88 samples/second
[1,7]<stdout>:Process 7 - Training time: 10.11 seconds
[1,7]<stdout>:Process 7 - Throughput: 9.89 samples/second
[1,15]<stdout>:Process 15 - Training time: 10.14 seconds
[1,15]<stdout>:Process 15 - Throughput: 9.86 samples/second
[1,19]<stdout>:Process 19 - Training time: 10.12 seconds
[1,19]<stdout>:Process 19 - Throughput: 9.89 samples/second
[1,14]<stdout>:Process 14 - Training time: 9.96 seconds
[1,14]<stdout>:Process 14 - Throughput: 10.04 samples/second
[1,13]<stdout>:Process 13 - Training time: 10.05 seconds
[1,13]<stdout>:Process 13 - Throughput: 9.95 samples/second
[1,16]<stdout>:Process 16 - Training time: 10.10 seconds
[1,16]<stdout>:Process 16 - Throughput: 9.90 samples/second
[1,26]<stdout>:Process 26 - Training time: 10.11 seconds
[1,26]<stdout>:Process 26 - Throughput: 9.89 samples/second
[1,10]<stdout>:Process 10 - Training time: 10.12 seconds
[1,10]<stdout>:Process 10 - Throughput: 9.88 samples/second
[1,11]<stdout>:Process 11 - Training time: 10.10 seconds
[1,11]<stdout>:Process 11 - Throughput: 9.90 samples/second
[1,8]<stdout>:Process 8 - Training time: 10.09 seconds
[1,8]<stdout>:Process 8 - Throughput: 9.91 samples/second
[1,4]<stdout>:Process 4 - Training time: 10.05 seconds
[1,4]<stdout>:Process 4 - Throughput: 9.95 samples/second
[1,9]<stdout>:Process 9 - Training time: 10.08 seconds
[1,9]<stdout>:Process 9 - Throughput: 9.92 samples/second
[1,21]<stdout>:bert-mpi-training-worker-2:26:576 [5] NCCL INFO [Service thread] Connection closed by localRank 7
[1,23]<stdout>:bert-mpi-training-worker-2:28:581 [7] NCCL INFO [Service thread] Connection closed by localRank 7
[1,19]<stdout>:bert-mpi-training-worker-2:24:583 [3] NCCL INFO [Service thread] Connection closed by localRank 7
[1,20]<stdout>:bert-mpi-training-worker-2:25:578 [4] NCCL INFO [Service thread] Connection closed by localRank 7
[1,20]<stdout>:bert-mpi-training-worker-2:25:578 [4] NCCL INFO [Service thread] Connection closed by localRank 5
[1,17]<stdout>:bert-mpi-training-worker-2:22:587 [1] NCCL INFO [Service thread] Connection closed by localRank 5
[1,21]<stdout>:bert-mpi-training-worker-2:26:576 [5] NCCL INFO [Service thread] Connection closed by localRank 5
[1,18]<stdout>:bert-mpi-training-worker-2:23:584 [2] NCCL INFO [Service thread] Connection closed by localRank 6
[1,21]<stdout>:bert-mpi-training-worker-2:26:576 [5] NCCL INFO [Service thread] Connection closed by localRank 6
[1,20]<stdout>:bert-mpi-training-worker-2:25:578 [4] NCCL INFO [Service thread] Connection closed by localRank 6
[1,22]<stdout>:bert-mpi-training-worker-2:27:575 [6] NCCL INFO [Service thread] Connection closed by localRank 6
[1,16]<stdout>:bert-mpi-training-worker-2:21:588 [0] NCCL INFO [Service thread] Connection closed by localRank 2
[1,17]<stdout>:bert-mpi-training-worker-2:22:587 [1] NCCL INFO [Service thread] Connection closed by localRank 2
[1,18]<stdout>:bert-mpi-training-worker-2:23:584 [2] NCCL INFO [Service thread] Connection closed by localRank 2
[1,22]<stdout>:bert-mpi-training-worker-2:27:575 [6] NCCL INFO [Service thread] Connection closed by localRank 2
[1,16]<stdout>:bert-mpi-training-worker-2:21:588 [0] NCCL INFO [Service thread] Connection closed by localRank 1
[1,21]<stdout>:bert-mpi-training-worker-2:26:576 [5] NCCL INFO [Service thread] Connection closed by localRank 1
[1,19]<stdout>:bert-mpi-training-worker-2:24:583 [3] NCCL INFO [Service thread] Connection closed by localRank 1
[1,17]<stdout>:bert-mpi-training-worker-2:22:587 [1] NCCL INFO [Service thread] Connection closed by localRank 1
[1,16]<stdout>:bert-mpi-training-worker-2:21:588 [0] NCCL INFO [Service thread] Connection closed by localRank 3
[1,17]<stdout>:bert-mpi-training-worker-2:22:587 [1] NCCL INFO [Service thread] Connection closed by localRank 3
[1,19]<stdout>:bert-mpi-training-worker-2:24:583 [3] NCCL INFO [Service thread] Connection closed by localRank 3
[1,23]<stdout>:bert-mpi-training-worker-2:28:581 [7] NCCL INFO [Service thread] Connection closed by localRank 3
[1,18]<stdout>:bert-mpi-training-worker-2:23:668 [0] NCCL INFO comm 0x565192948c00 rank 18 nranks 32 cudaDev 2 busId 190 - Abort COMPLETE
[1,23]<stdout>:bert-mpi-training-worker-2:28:664 [0] NCCL INFO comm 0x5582a4a63380 rank 23 nranks 32 cudaDev 7 busId 1e0 - Abort COMPLETE
[1,19]<stdout>:bert-mpi-training-worker-2:24:669 [0] NCCL INFO comm 0x5617488778d0 rank 19 nranks 32 cudaDev 3 busId 1a0 - Abort COMPLETE
[1,0]<stdout>:bert-mpi-training-worker-0:21:583 [0] NCCL INFO [Service thread] Connection closed by localRank 3
[1,1]<stdout>:bert-mpi-training-worker-0:22:582 [1] NCCL INFO [Service thread] Connection closed by localRank 3
[1,3]<stdout>:bert-mpi-training-worker-0:24:591 [3] NCCL INFO [Service thread] Connection closed by localRank 3
[1,7]<stdout>:bert-mpi-training-worker-0:28:578 [7] NCCL INFO [Service thread] Connection closed by localRank 3
[1,0]<stdout>:bert-mpi-training-worker-0:21:583 [0] NCCL INFO [Service thread] Connection closed by localRank 1
[1,3]<stdout>:bert-mpi-training-worker-0:24:591 [3] NCCL INFO [Service thread] Connection closed by localRank 1
[1,5]<stdout>:bert-mpi-training-worker-0:26:585 [5] NCCL INFO [Service thread] Connection closed by localRank 1
[1,1]<stdout>:bert-mpi-training-worker-0:22:582 [1] NCCL INFO [Service thread] Connection closed by localRank 1
[1,2]<stdout>:bert-mpi-training-worker-0:23:584 [2] NCCL INFO [Service thread] Connection closed by localRank 6
[1,4]<stdout>:bert-mpi-training-worker-0:25:589 [4] NCCL INFO [Service thread] Connection closed by localRank 6
[1,5]<stdout>:bert-mpi-training-worker-0:26:585 [5] NCCL INFO [Service thread] Connection closed by localRank 6
[1,6]<stdout>:bert-mpi-training-worker-0:27:579 [6] NCCL INFO [Service thread] Connection closed by localRank 6
[1,4]<stdout>:bert-mpi-training-worker-0:25:589 [4] NCCL INFO [Service thread] Connection closed by localRank 5
[1,1]<stdout>:bert-mpi-training-worker-0:22:582 [1] NCCL INFO [Service thread] Connection closed by localRank 5
[1,5]<stdout>:bert-mpi-training-worker-0:26:585 [5] NCCL INFO [Service thread] Connection closed by localRank 5
[1,1]<stdout>:bert-mpi-training-worker-0:22:582 [1] NCCL INFO [Service thread] Connection closed by localRank 2
[1,0]<stdout>:bert-mpi-training-worker-0:21:583 [0] NCCL INFO [Service thread] Connection closed by localRank 2
[1,2]<stdout>:bert-mpi-training-worker-0:23:584 [2] NCCL INFO [Service thread] Connection closed by localRank 2
[1,6]<stdout>:bert-mpi-training-worker-0:27:579 [6] NCCL INFO [Service thread] Connection closed by localRank 2
[1,3]<stdout>:bert-mpi-training-worker-0:24:591 [3] NCCL INFO [Service thread] Connection closed by localRank 7
[1,4]<stdout>:bert-mpi-training-worker-0:25:589 [4] NCCL INFO [Service thread] Connection closed by localRank 7
[1,5]<stdout>:bert-mpi-training-worker-0:26:585 [5] NCCL INFO [Service thread] Connection closed by localRank 7
[1,7]<stdout>:bert-mpi-training-worker-0:28:578 [7] NCCL INFO [Service thread] Connection closed by localRank 7
[1,12]<stdout>:bert-mpi-training-worker-1:25:582 [4] NCCL INFO [Service thread] Connection closed by localRank 5
[1,13]<stdout>:bert-mpi-training-worker-1:26:578 [5] NCCL INFO [Service thread] Connection closed by localRank 5
[1,9]<stdout>:bert-mpi-training-worker-1:22:585 [1] NCCL INFO [Service thread] Connection closed by localRank 5
[1,29]<stdout>:bert-mpi-training-worker-3:26:577 [5] NCCL INFO [Service thread] Connection closed by localRank 7
[1,28]<stdout>:bert-mpi-training-worker-3:25:583 [4] NCCL INFO [Service thread] Connection closed by localRank 7
[1,31]<stdout>:bert-mpi-training-worker-3:28:576 [7] NCCL INFO [Service thread] Connection closed by localRank 7
[1,27]<stdout>:bert-mpi-training-worker-3:24:578 [3] NCCL INFO [Service thread] Connection closed by localRank 7
[1,13]<stdout>:bert-mpi-training-worker-1:26:578 [5] NCCL INFO [Service thread] Connection closed by localRank 6
[1,12]<stdout>:bert-mpi-training-worker-1:25:582 [4] NCCL INFO [Service thread] Connection closed by localRank 6
[1,10]<stdout>:bert-mpi-training-worker-1:23:580 [2] NCCL INFO [Service thread] Connection closed by localRank 6
[1,14]<stdout>:bert-mpi-training-worker-1:27:577 [6] NCCL INFO [Service thread] Connection closed by localRank 6
[1,26]<stdout>:bert-mpi-training-worker-3:23:579 [2] NCCL INFO [Service thread] Connection closed by localRank 6
[1,29]<stdout>:bert-mpi-training-worker-3:26:577 [5] NCCL INFO [Service thread] Connection closed by localRank 6
[1,28]<stdout>:bert-mpi-training-worker-3:25:583 [4] NCCL INFO [Service thread] Connection closed by localRank 6
[1,30]<stdout>:bert-mpi-training-worker-3:27:575 [6] NCCL INFO [Service thread] Connection closed by localRank 6
[1,28]<stdout>:bert-mpi-training-worker-3:25:583 [4] NCCL INFO [Service thread] Connection closed by localRank 5
[1,25]<stdout>:bert-mpi-training-worker-3:22:584 [1] NCCL INFO [Service thread] Connection closed by localRank 5
[1,29]<stdout>:bert-mpi-training-worker-3:26:577 [5] NCCL INFO [Service thread] Connection closed by localRank 5
[1,8]<stdout>:bert-mpi-training-worker-1:21:583 [0] NCCL INFO [Service thread] Connection closed by localRank 2
[1,9]<stdout>:bert-mpi-training-worker-1:22:585 [1] NCCL INFO [Service thread] Connection closed by localRank 2
[1,14]<stdout>:bert-mpi-training-worker-1:27:577 [6] NCCL INFO [Service thread] Connection closed by localRank 2
[1,10]<stdout>:bert-mpi-training-worker-1:23:580 [2] NCCL INFO [Service thread] Connection closed by localRank 2
[1,13]<stdout>:bert-mpi-training-worker-1:26:578 [5] NCCL INFO [Service thread] Connection closed by localRank 7
[1,12]<stdout>:bert-mpi-training-worker-1:25:582 [4] NCCL INFO [Service thread] Connection closed by localRank 7
[1,11]<stdout>:bert-mpi-training-worker-1:24:579 [3] NCCL INFO [Service thread] Connection closed by localRank 7
[1,15]<stdout>:bert-mpi-training-worker-1:28:575 [7] NCCL INFO [Service thread] Connection closed by localRank 7
[1,8]<stdout>:bert-mpi-training-worker-1:21:583 [0] NCCL INFO [Service thread] Connection closed by localRank 1
[1,9]<stdout>:bert-mpi-training-worker-1:22:585 [1] NCCL INFO [Service thread] Connection closed by localRank 1
[1,13]<stdout>:bert-mpi-training-worker-1:26:578 [5] NCCL INFO [Service thread] Connection closed by localRank 1
[1,11]<stdout>:bert-mpi-training-worker-1:24:579 [3] NCCL INFO [Service thread] Connection closed by localRank 1
[1,2]<stdout>:bert-mpi-training-worker-0:23:668 [0] NCCL INFO comm 0x5653dbf7d100 rank 2 nranks 32 cudaDev 2 busId 190 - Abort COMPLETE
[1,7]<stdout>:bert-mpi-training-worker-0:28:672 [0] NCCL INFO comm 0x563f66c76d80 rank 7 nranks 32 cudaDev 7 busId 1e0 - Abort COMPLETE
[1,25]<stdout>:bert-mpi-training-worker-3:22:584 [1] NCCL INFO [Service thread] Connection closed by localRank 2
[1,24]<stdout>:bert-mpi-training-worker-3:21:581 [0] NCCL INFO [Service thread] Connection closed by localRank 2
[1,26]<stdout>:bert-mpi-training-worker-3:23:579 [2] NCCL INFO [Service thread] Connection closed by localRank 2
[1,30]<stdout>:bert-mpi-training-worker-3:27:575 [6] NCCL INFO [Service thread] Connection closed by localRank 2
[1,3]<stdout>:bert-mpi-training-worker-0:24:666 [0] NCCL INFO comm 0x55986c6b8ac0 rank 3 nranks 32 cudaDev 3 busId 1a0 - Abort COMPLETE
[1,24]<stdout>:bert-mpi-training-worker-3:21:581 [0] NCCL INFO [Service thread] Connection closed by localRank 3
[1,25]<stdout>:bert-mpi-training-worker-3:22:584 [1] NCCL INFO [Service thread] Connection closed by localRank 3
[1,31]<stdout>:bert-mpi-training-worker-3:28:576 [7] NCCL INFO [Service thread] Connection closed by localRank 3
[1,27]<stdout>:bert-mpi-training-worker-3:24:578 [3] NCCL INFO [Service thread] Connection closed by localRank 3
[1,8]<stdout>:bert-mpi-training-worker-1:21:583 [0] NCCL INFO [Service thread] Connection closed by localRank 3
[1,9]<stdout>:bert-mpi-training-worker-1:22:585 [1] NCCL INFO [Service thread] Connection closed by localRank 3
[1,15]<stdout>:bert-mpi-training-worker-1:28:575 [7] NCCL INFO [Service thread] Connection closed by localRank 3
[1,11]<stdout>:bert-mpi-training-worker-1:24:579 [3] NCCL INFO [Service thread] Connection closed by localRank 3
[1,10]<stdout>:bert-mpi-training-worker-1:23:667 [0] NCCL INFO comm 0x560b8538e2c0 rank 10 nranks 32 cudaDev 2 busId 190 - Abort COMPLETE
[1,24]<stdout>:bert-mpi-training-worker-3:21:581 [0] NCCL INFO [Service thread] Connection closed by localRank 1
[1,27]<stdout>:bert-mpi-training-worker-3:24:578 [3] NCCL INFO [Service thread] Connection closed by localRank 1
[1,25]<stdout>:bert-mpi-training-worker-3:22:584 [1] NCCL INFO [Service thread] Connection closed by localRank 1
[1,29]<stdout>:bert-mpi-training-worker-3:26:577 [5] NCCL INFO [Service thread] Connection closed by localRank 1
[1,26]<stdout>:bert-mpi-training-worker-3:23:670 [0] NCCL INFO comm 0x55adee232c40 rank 26 nranks 32 cudaDev 2 busId 190 - Abort COMPLETE
[1,31]<stdout>:bert-mpi-training-worker-3:28:663 [0] NCCL INFO comm 0x558a2e4d59c0 rank 31 nranks 32 cudaDev 7 busId 1e0 - Abort COMPLETE
[1,15]<stdout>:bert-mpi-training-worker-1:28:664 [0] NCCL INFO comm 0x55833c92f9c0 rank 15 nranks 32 cudaDev 7 busId 1e0 - Abort COMPLETE
[1,11]<stdout>:bert-mpi-training-worker-1:24:668 [0] NCCL INFO comm 0x564656b04380 rank 11 nranks 32 cudaDev 3 busId 1a0 - Abort COMPLETE
[1,27]<stdout>:bert-mpi-training-worker-3:24:667 [0] NCCL INFO comm 0x55600fb9c840 rank 27 nranks 32 cudaDev 3 busId 1a0 - Abort COMPLETE
[1,8]<stdout>:bert-mpi-training-worker-1:21:583 [0] NCCL INFO [Service thread] Connection closed by localRank 0
[1,9]<stdout>:bert-mpi-training-worker-1:22:585 [1] NCCL INFO [Service thread] Connection closed by localRank 0
[1,12]<stdout>:bert-mpi-training-worker-1:25:582 [4] NCCL INFO [Service thread] Connection closed by localRank 0
[1,9]<stdout>:bert-mpi-training-worker-1:22:670 [0] NCCL INFO comm 0x5613f202fdc0 rank 9 nranks 32 cudaDev 1 busId 180 - Abort COMPLETE
[1,24]<stdout>:bert-mpi-training-worker-3:21:581 [0] NCCL INFO [Service thread] Connection closed by localRank 4
[1,28]<stdout>:bert-mpi-training-worker-3:25:583 [4] NCCL INFO [Service thread] Connection closed by localRank 4
[1,30]<stdout>:bert-mpi-training-worker-3:27:575 [6] NCCL INFO [Service thread] Connection closed by localRank 4
[1,29]<stdout>:bert-mpi-training-worker-3:26:577 [5] NCCL INFO [Service thread] Connection closed by localRank 4
[1,30]<stdout>:bert-mpi-training-worker-3:27:669 [0] NCCL INFO comm 0x5609cdbbff40 rank 30 nranks 32 cudaDev 6 busId 1d0 - Abort COMPLETE
[1,29]<stdout>:bert-mpi-training-worker-3:26:664 [0] NCCL INFO comm 0x55768c2aaf40 rank 29 nranks 32 cudaDev 5 busId 1c0 - Abort COMPLETE
[1,16]<stdout>:bert-mpi-training-worker-2:21:588 [0] NCCL INFO [Service thread] Connection closed by localRank 4
[1,21]<stdout>:bert-mpi-training-worker-2:26:576 [5] NCCL INFO [Service thread] Connection closed by localRank 4
[1,22]<stdout>:bert-mpi-training-worker-2:27:575 [6] NCCL INFO [Service thread] Connection closed by localRank 4
[1,20]<stdout>:bert-mpi-training-worker-2:25:578 [4] NCCL INFO [Service thread] Connection closed by localRank 4
[1,22]<stdout>:bert-mpi-training-worker-2:27:666 [0] NCCL INFO comm 0x5648da365f80 rank 22 nranks 32 cudaDev 6 busId 1d0 - Abort COMPLETE
[1,21]<stdout>:bert-mpi-training-worker-2:26:665 [0] NCCL INFO comm 0x5571768f15c0 rank 21 nranks 32 cudaDev 5 busId 1c0 - Abort COMPLETE
[1,16]<stdout>:bert-mpi-training-worker-2:21:588 [0] NCCL INFO [Service thread] Connection closed by localRank 0
[1,20]<stdout>:bert-mpi-training-worker-2:25:578 [4] NCCL INFO [Service thread] Connection closed by localRank 0
[1,17]<stdout>:bert-mpi-training-worker-2:22:587 [1] NCCL INFO [Service thread] Connection closed by localRank 0
[1,17]<stdout>:bert-mpi-training-worker-2:22:667 [0] NCCL INFO comm 0x55e178f38c80 rank 17 nranks 32 cudaDev 1 busId 180 - Abort COMPLETE
[1,20]<stdout>:bert-mpi-training-worker-2:25:663 [0] NCCL INFO comm 0x559fbed3b980 rank 20 nranks 32 cudaDev 4 busId 1b0 - Abort COMPLETE
[1,16]<stdout>:bert-mpi-training-worker-2:21:670 [0] NCCL INFO comm 0x556941882cc0 rank 16 nranks 32 cudaDev 0 busId 170 - Abort COMPLETE
[1,24]<stdout>:bert-mpi-training-worker-3:21:581 [0] NCCL INFO [Service thread] Connection closed by localRank 0
[1,28]<stdout>:bert-mpi-training-worker-3:25:583 [4] NCCL INFO [Service thread] Connection closed by localRank 0
[1,25]<stdout>:bert-mpi-training-worker-3:22:584 [1] NCCL INFO [Service thread] Connection closed by localRank 0
[1,25]<stdout>:bert-mpi-training-worker-3:22:666 [0] NCCL INFO comm 0x55b4b90f7540 rank 25 nranks 32 cudaDev 1 busId 180 - Abort COMPLETE
[1,28]<stdout>:bert-mpi-training-worker-3:25:665 [0] NCCL INFO comm 0x55bdd0698c80 rank 28 nranks 32 cudaDev 4 busId 1b0 - Abort COMPLETE
[1,24]<stdout>:bert-mpi-training-worker-3:21:668 [0] NCCL INFO comm 0x556aae88b300 rank 24 nranks 32 cudaDev 0 busId 170 - Abort COMPLETE
[1,8]<stdout>:bert-mpi-training-worker-1:21:583 [0] NCCL INFO [Service thread] Connection closed by localRank 4
[1,14]<stdout>:bert-mpi-training-worker-1:27:577 [6] NCCL INFO [Service thread] Connection closed by localRank 4
[1,12]<stdout>:bert-mpi-training-worker-1:25:582 [4] NCCL INFO [Service thread] Connection closed by localRank 4
[1,13]<stdout>:bert-mpi-training-worker-1:26:578 [5] NCCL INFO [Service thread] Connection closed by localRank 4
[1,14]<stdout>:bert-mpi-training-worker-1:27:665 [0] NCCL INFO comm 0x55e5f0da8fc0 rank 14 nranks 32 cudaDev 6 busId 1d0 - Abort COMPLETE
[1,13]<stdout>:bert-mpi-training-worker-1:26:666 [0] NCCL INFO comm 0x5591c7c2f340 rank 13 nranks 32 cudaDev 5 busId 1c0 - Abort COMPLETE
[1,12]<stdout>:bert-mpi-training-worker-1:25:663 [0] NCCL INFO comm 0x557b453cac40 rank 12 nranks 32 cudaDev 4 busId 1b0 - Abort COMPLETE
[1,8]<stdout>:bert-mpi-training-worker-1:21:669 [0] NCCL INFO comm 0x5648e3019c80 rank 8 nranks 32 cudaDev 0 busId 170 - Abort COMPLETE
[1,1]<stdout>:bert-mpi-training-worker-0:22:582 [1] NCCL INFO [Service thread] Connection closed by localRank 0
[1,0]<stdout>:bert-mpi-training-worker-0:21:583 [0] NCCL INFO [Service thread] Connection closed by localRank 0
[1,4]<stdout>:bert-mpi-training-worker-0:25:589 [4] NCCL INFO [Service thread] Connection closed by localRank 0
[1,1]<stdout>:bert-mpi-training-worker-0:22:670 [0] NCCL INFO comm 0x559fe7380640 rank 1 nranks 32 cudaDev 1 busId 180 - Abort COMPLETE
[1,0]<stdout>:bert-mpi-training-worker-0:21:583 [0] NCCL INFO [Service thread] Connection closed by localRank 4
[1,6]<stdout>:bert-mpi-training-worker-0:27:579 [6] NCCL INFO [Service thread] Connection closed by localRank 4
[1,5]<stdout>:bert-mpi-training-worker-0:26:585 [5] NCCL INFO [Service thread] Connection closed by localRank 4
[1,4]<stdout>:bert-mpi-training-worker-0:25:589 [4] NCCL INFO [Service thread] Connection closed by localRank 4
[1,6]<stdout>:bert-mpi-training-worker-0:27:671 [0] NCCL INFO comm 0x5640d3700dc0 rank 6 nranks 32 cudaDev 6 busId 1d0 - Abort COMPLETE
[1,5]<stdout>:bert-mpi-training-worker-0:26:669 [0] NCCL INFO comm 0x564c64f91780 rank 5 nranks 32 cudaDev 5 busId 1c0 - Abort COMPLETE
[1,4]<stdout>:bert-mpi-training-worker-0:25:673 [0] NCCL INFO comm 0x5641c3f88e40 rank 4 nranks 32 cudaDev 4 busId 1b0 - Abort COMPLETE
[1,0]<stdout>:bert-mpi-training-worker-0:21:667 [0] NCCL INFO comm 0x55bc31870680 rank 0 nranks 32 cudaDev 0 busId 170 - Abort COMPLETE

ndbaker1 · 2024-07-22T17:09:39Z

e2e2/test/images/bert-training/train.py

+    # TODO: Consider parameterizing for nodes of any GPU count
+    num_gpus_per_node = 8  # Adjust this based on your setup


we need to do this now right, in case we want to call a job with a different instance type entirely? i guess the testing instances have been the same capacity so far

Yeah, at the moment all instance types requested have 8 GPUs. I don't really ever see us running the training test on any instance type with less than 8 GPUs, unless EC2 suddenly changes their pattern for instance configurations.

It would future proof, but also add a touch more complexity and another point of failure if we have to have it configured at runtime.

@ndbaker1 I'd actually probably advocate I take out the TODO comment, and just leave it hard coded. Thoughts?

was thinking we could just add an

ENV GPUS_PER_NODE=8

in the Dockerfile and it wouldn't complicate much right?

Yeah that works. I'm thinking further down stream (upstream..?) for parameterizitation of the manifest, and collection of the number of GPUs on a node by our Go test.

So would really need to be something like

ARG GPUS_PER_NODE=8

unless I'm misunderstanding something.

@ndbaker1 Made the change, verified with local testing

ndbaker1 reviewed Jun 26, 2024

View reviewed changes

Issacwww reviewed Jun 26, 2024

View reviewed changes

mattcjo marked this pull request as draft July 1, 2024 21:15

mattcjo added 6 commits July 11, 2024 17:36

Add python training script, requirements.txt (dependencies), and dock…

af9fda0

…erfile for the e2e BERT training task

Add github action to build bert-testing image on PR

104fa93

Specify directory the BERT training image should be built in for the …

477f672

…github action

Add default values and include in docker env for MASTER_ADDR and MAST…

fb7d18f

…ER_HOST

Slightly change env var value retrieval. Also ran a formatter to pret…

b5aedc7

…ty it up.

Update bert training dockerfile to include amazon specific packages f…

7f9480b

…or MPI, NCCL, and EFA.

mattcjo force-pushed the main branch from 6299834 to 7f9480b Compare July 11, 2024 17:48

weicongw reviewed Jul 11, 2024

View reviewed changes

mattcjo added 7 commits July 16, 2024 16:16

Change Dockerfile.bert-training file name to just Dockerfile

19613e1

Update git workflow to use new Dockerfile path since the name was upd…

974da50

…ated

Update Docker image to use Python version 3.10.12 and build from sour…

5b4ae1a

…ce to be consistent with the other test images

Merge remote-tracking branch 'upstream/main'

6bc3ef4

Remove extra line

fa8d244

Had been setting MASTER_ADDR and MASTER_PORT env vars twice. Removed …

f87ba65

…duplicate

Set each process to a GPU via local rank instead of overall rank

7af6b13

Merge remote-tracking branch 'upstream/main'

1a3ad52

mattcjo marked this pull request as ready for review July 19, 2024 21:15

Change comment describing section in dockerfile

1f5b1c9

ndbaker1 reviewed Jul 22, 2024

View reviewed changes

mattcjo and others added 2 commits July 23, 2024 10:50

Merge branch 'aws:main' into main

b67026c

parameterize number of gpus per node in Dockerfile and train.py

4a8e0ec

ndbaker1 approved these changes Jul 27, 2024

View reviewed changes

Issacwww approved these changes Jul 31, 2024

View reviewed changes

mattcjo added 2 commits July 31, 2024 22:09

Merge remote-tracking branch 'upstream/main'

60ddc02

formatting in train.py

01d8270

ndbaker1 merged commit b133519 into aws:main Aug 1, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add docker image for BERT e2e training task #454

Add docker image for BERT e2e training task #454

mattcjo commented Jun 26, 2024 •

edited

Loading

ndbaker1 Jun 26, 2024

ndbaker1 Jun 26, 2024

mattcjo Jun 26, 2024

mattcjo Jun 26, 2024

ndbaker1 Jun 26, 2024

mattcjo Jun 26, 2024

mattcjo Jun 27, 2024

Issacwww Jun 26, 2024

mattcjo Jun 26, 2024 •

edited

Loading

ndbaker1 Jun 26, 2024 •

edited

Loading

Issacwww Jun 26, 2024

mattcjo Jun 27, 2024

Issacwww Jun 27, 2024

weicongw Jun 26, 2024

mattcjo Jul 11, 2024

weicongw Jul 11, 2024

mattcjo Jul 11, 2024

weicongw Jul 11, 2024

mattcjo Jul 11, 2024

weicongw Jul 11, 2024

mattcjo Jul 11, 2024

mattcjo Jul 11, 2024

mattcjo commented Jul 18, 2024 •

edited

Loading

ndbaker1 Jul 22, 2024

mattcjo Jul 22, 2024

mattcjo Jul 22, 2024

mattcjo Jul 22, 2024

ndbaker1 Jul 23, 2024

mattcjo Jul 23, 2024

mattcjo Jul 23, 2024 •

edited

Loading

mattcjo Jul 23, 2024

		os.environ['MASTER_ADDR'] = os.environ['MASTER_ADDR'] # Kubernetes sets this
		os.environ['MASTER_PORT'] = os.environ['MASTER_PORT'] # Kubernetes sets this

		print(f"Process {rank} - Training time: {training_time:.2f} seconds")
		print(f"Process {rank} - Throughput: {throughput:.2f} samples/second")


		start_time = time.time()

		for epoch in range(1): # Short run for testing

		# TODO: Consider parameterizing for nodes of any GPU count
		num_gpus_per_node = 8 # Adjust this based on your setup

Add docker image for BERT e2e training task #454

Add docker image for BERT e2e training task #454

Conversation

mattcjo commented Jun 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattcjo Jun 26, 2024 • edited Loading

Choose a reason for hiding this comment

ndbaker1 Jun 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattcjo commented Jul 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattcjo Jul 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattcjo commented Jun 26, 2024 •

edited

Loading

mattcjo Jun 26, 2024 •

edited

Loading

ndbaker1 Jun 26, 2024 •

edited

Loading

mattcjo commented Jul 18, 2024 •

edited

Loading

mattcjo Jul 23, 2024 •

edited

Loading