Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latest skypilot image does not support azure accelerated networking and nccl #4448

Open
visatish opened this issue Dec 8, 2024 · 3 comments
Labels

Comments

@visatish
Copy link

visatish commented Dec 8, 2024

Enabling azure accelerating networking with the latest skypilot image breaks the nccl test.

Enabling accelerated networking:

diff --git a/sky/provision/azure/instance.py b/sky/provision/azure/instance.py
index 60159232..6c4df022 100644
--- a/sky/provision/azure/instance.py
+++ b/sky/provision/azure/instance.py
@@ -239,7 +239,8 @@ def _create_network_interface(
             location=provider_config['location'],
             ip_configurations=[ip_config],
             network_security_group=network.NetworkSecurityGroup(
-                id=provider_config['nsg'])))
+                id=provider_config['nsg']),
+            enable_accelerated_networking=True))
     logger.info(f'Created network interface {ni_poller.result().name}.')
     return ni_poller.result()
 

Updating nccl_test.py for azure/debugging:

diff --git a/examples/nccl_test.yaml b/examples/nccl_test.yaml
index 046e72cc..5a44e59b 100644
--- a/examples/nccl_test.yaml
+++ b/examples/nccl_test.yaml
@@ -19,7 +19,9 @@ name: torch-nccl-allreduce
 num_nodes: 2
 
 resources:
-  accelerators: A100:8
+  cloud: azure
+  region: westus2
+  accelerators: A100-80GB:4
   use_spot: True
 
 setup: |
@@ -30,7 +32,7 @@ run: |
   cd ml-engineering/network/benchmarks
   NNODES=`echo "$SKYPILOT_NODE_IPS" | wc -l`
   MASTER_ADDR=`echo "$SKYPILOT_NODE_IPS" | head -n1`
-  python -u -m torch.distributed.run \
+  NCCL_DEBUG=INFO python -u -m torch.distributed.run \
     --nproc_per_node $SKYPILOT_NUM_GPUS_PER_NODE \
     --nnodes $NNODES \
     --rdzv_endpoint $MASTER_ADDR:8888 \
@@ -39,4 +41,4 @@ run: |
     --role `hostname -s`: \
     --tee 3 \
     all_reduce_bench.py
-    
\ No newline at end of file
+    

Output:

sky launch -c nccl --use-spot examples/nccl_test.yaml
Task from YAML spec: nccl_test.yaml
Considered resources (2 nodes):
-------------------------------------------------------------------------------------------------------------
 CLOUD   INSTANCE                         vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
-------------------------------------------------------------------------------------------------------------
 Azure   Standard_NC96ads_A100_v4[Spot]   96      880       A100-80GB:4    westus2       4.93          ✔     
-------------------------------------------------------------------------------------------------------------
Launching a new cluster 'nccl'. Proceed? [Y/n]: y
Launching an unmanaged spot task, which does not automatically recover from preemptions.
To get automatic recovery, use managed job instead: sky jobs launch or sky.jobs.launch().
⚙︎ Launching on Azure westus2.
└── Instances are up.
✓ Cluster launched: nccl.  View logs at: ~/sky_logs/sky-2024-12-07-22-00-04-119559/provision.log
⚙︎ Running setup on 2 VMs.
Collecting torch
Collecting torch
  Downloading torch-2.5.1-cp310-cp310-manylinux1_x86_64.whl.metadata (28 kB)
  Downloading torch-2.5.1-cp310-cp310-manylinux1_x86_64.whl.metadata (28 kB)
Collecting filelock (from torch)
  Downloading filelock-3.16.1-py3-none-any.whl.metadata (2.9 kB)
Collecting filelock (from torch)
  Downloading filelock-3.16.1-py3-none-any.whl.metadata (2.9 kB)
Collecting typing-extensions>=4.8.0 (from torch)
Collecting typing-extensions>=4.8.0 (from torch)
  Downloading typing_extensions-4.12.2-py3-none-any.whl.metadata (3.0 kB)
  Downloading typing_extensions-4.12.2-py3-none-any.whl.metadata (3.0 kB)
Collecting networkx (from torch)
Collecting networkx (from torch)
  Downloading networkx-3.4.2-py3-none-any.whl.metadata (6.3 kB)
  Downloading networkx-3.4.2-py3-none-any.whl.metadata (6.3 kB)
Collecting jinja2 (from torch)
  Downloading jinja2-3.1.4-py3-none-any.whl.metadata (2.6 kB)
Collecting jinja2 (from torch)
  Downloading jinja2-3.1.4-py3-none-any.whl.metadata (2.6 kB)
Collecting fsspec (from torch)
  Downloading fsspec-2024.10.0-py3-none-any.whl.metadata (11 kB)
Collecting fsspec (from torch)
  Downloading fsspec-2024.10.0-py3-none-any.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5.147-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cusolver-cu12==11.6.1.9 (from torch)
  Downloading nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cusparse-cu12==12.3.1.170 (from torch)
  Downloading nvidia_cusparse_cu12-12.3.1.170-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-nccl-cu12==2.21.5 (from torch)
  Downloading nvidia_nccl_cu12-2.21.5-py3-none-manylinux2014_x86_64.whl.metadata (1.8 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-nvtx-cu12==12.4.127 (from torch)
  Downloading nvidia_nvtx_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.7 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-nvjitlink-cu12==12.4.127 (from torch)
  Downloading nvidia_nvjitlink_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5.147-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting triton==3.1.0 (from torch)
  Downloading triton-3.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.3 kB)
Collecting nvidia-cusolver-cu12==11.6.1.9 (from torch)
Collecting sympy==1.13.1 (from torch)
  Downloading nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
  Downloading sympy-1.13.1-py3-none-any.whl.metadata (12 kB)
Collecting nvidia-cusparse-cu12==12.3.1.170 (from torch)
  Downloading nvidia_cusparse_cu12-12.3.1.170-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting mpmath<1.4,>=1.1.0 (from sympy==1.13.1->torch)
  Downloading mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)
Collecting nvidia-nccl-cu12==2.21.5 (from torch)
  Downloading nvidia_nccl_cu12-2.21.5-py3-none-manylinux2014_x86_64.whl.metadata (1.8 kB)
Collecting nvidia-nvtx-cu12==12.4.127 (from torch)
  Downloading nvidia_nvtx_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.7 kB)
Collecting nvidia-nvjitlink-cu12==12.4.127 (from torch)
  Downloading nvidia_nvjitlink_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting MarkupSafe>=2.0 (from jinja2->torch)
  Downloading MarkupSafe-3.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.0 kB)
Downloading torch-2.5.1-cp310-cp310-manylinux1_x86_64.whl (906.4 MB)
Collecting triton==3.1.0 (from torch)
  Downloading triton-3.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.3 kB)
Collecting sympy==1.13.1 (from torch)
  Downloading sympy-1.13.1-py3-none-any.whl.metadata (12 kB)
Collecting mpmath<1.4,>=1.1.0 (from sympy==1.13.1->torch)
  Downloading mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)
Collecting MarkupSafe>=2.0 (from jinja2->torch)
  Downloading MarkupSafe-3.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.0 kB)
Downloading torch-2.5.1-cp310-cp310-manylinux1_x86_64.whl (906.4 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 906.4/906.4 MB 1.5 MB/s eta 0:00:00
Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl (363.4 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 906.4/906.4 MB 1.5 MB/s eta 0:00:00
Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl (363.4 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 363.4/363.4 MB 1.8 MB/s eta 0:00:00
Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (13.8 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 363.4/363.4 MB 3.3 MB/s eta 0:00:00
Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (13.8 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.8/13.8 MB 51.9 MB/s eta 0:00:00
Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (24.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24.6/24.6 MB 36.8 MB/s eta 0:00:00
Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (883 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 883.7/883.7 kB 21.8 MB/s eta 0:00:00
Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.8/13.8 MB 3.7 MB/s eta 0:00:00
Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (24.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24.6/24.6 MB 40.4 MB/s eta 0:00:00
Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (883 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 883.7/883.7 kB 41.4 MB/s eta 0:00:00
Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 664.8/664.8 MB 2.1 MB/s eta 0:00:00
Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl (211.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 664.8/664.8 MB 2.1 MB/s eta 0:00:00
Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl (211.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 211.5/211.5 MB 4.8 MB/s eta 0:00:00
Downloading nvidia_curand_cu12-10.3.5.147-py3-none-manylinux2014_x86_64.whl (56.3 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.3/56.3 MB 12.9 MB/s eta 0:00:00
Downloading nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl (127.9 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 211.5/211.5 MB 4.8 MB/s eta 0:00:00
Downloading nvidia_curand_cu12-10.3.5.147-py3-none-manylinux2014_x86_64.whl (56.3 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 127.9/127.9 MB 7.6 MB/s eta 0:00:00
Downloading nvidia_cusparse_cu12-12.3.1.170-py3-none-manylinux2014_x86_64.whl (207.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.3/56.3 MB 14.8 MB/s eta 0:00:00
Downloading nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl (127.9 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 127.9/127.9 MB 8.0 MB/s eta 0:00:00
Downloading nvidia_cusparse_cu12-12.3.1.170-py3-none-manylinux2014_x86_64.whl (207.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 207.5/207.5 MB 5.0 MB/s eta 0:00:00
Downloading nvidia_nccl_cu12-2.21.5-py3-none-manylinux2014_x86_64.whl (188.7 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 207.5/207.5 MB 4.9 MB/s eta 0:00:00
Downloading nvidia_nccl_cu12-2.21.5-py3-none-manylinux2014_x86_64.whl (188.7 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 188.7/188.7 MB 5.3 MB/s eta 0:00:00
Downloading nvidia_nvjitlink_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (21.1 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21.1/21.1 MB 32.5 MB/s eta 0:00:00
Downloading nvidia_nvtx_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (99 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99.1/99.1 kB 7.5 MB/s eta 0:00:00
Downloading sympy-1.13.1-py3-none-any.whl (6.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 101.4 MB/s eta 0:00:00
Downloading triton-3.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (209.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 188.7/188.7 MB 1.4 MB/s eta 0:00:00
Downloading nvidia_nvjitlink_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (21.1 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 21.1/21.1 MB 36.7 MB/s eta 0:00:00
Downloading nvidia_nvtx_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (99 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99.1/99.1 kB 6.3 MB/s eta 0:00:00
Downloading sympy-1.13.1-py3-none-any.whl (6.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.2/6.2 MB 102.5 MB/s eta 0:00:00
Downloading triton-3.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (209.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 209.5/209.5 MB 5.5 MB/s eta 0:00:00
Downloading typing_extensions-4.12.2-py3-none-any.whl (37 kB)
Downloading filelock-3.16.1-py3-none-any.whl (16 kB)
Downloading fsspec-2024.10.0-py3-none-any.whl (179 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 179.6/179.6 kB 12.6 MB/s eta 0:00:00
Downloading jinja2-3.1.4-py3-none-any.whl (133 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 133.3/133.3 kB 10.5 MB/s eta 0:00:00
Downloading networkx-3.4.2-py3-none-any.whl (1.7 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 45.1 MB/s eta 0:00:00
Downloading MarkupSafe-3.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20 kB)
Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 27.5 MB/s eta 0:00:00
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 209.5/209.5 MB 5.1 MB/s eta 0:00:00
Downloading typing_extensions-4.12.2-py3-none-any.whl (37 kB)
Downloading filelock-3.16.1-py3-none-any.whl (16 kB)
Downloading fsspec-2024.10.0-py3-none-any.whl (179 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 179.6/179.6 kB 13.0 MB/s eta 0:00:00
Downloading jinja2-3.1.4-py3-none-any.whl (133 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 133.3/133.3 kB 9.8 MB/s eta 0:00:00
Downloading networkx-3.4.2-py3-none-any.whl (1.7 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 52.2 MB/s eta 0:00:00
Downloading MarkupSafe-3.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (20 kB)
Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 kB 26.5 MB/s eta 0:00:00
Installing collected packages: mpmath, typing-extensions, sympy, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, triton, nvidia-cusparse-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch
Installing collected packages: mpmath, typing-extensions, sympy, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, networkx, MarkupSafe, fsspec, filelock, triton, nvidia-cusparse-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch
Successfully installed MarkupSafe-3.0.2 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.4.5.8 nvidia-cuda-cupti-cu12-12.4.127 nvidia-cuda-nvrtc-cu12-12.4.127 nvidia-cuda-runtime-cu12-12.4.127 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.2.1.3 nvidia-curand-cu12-10.3.5.147 nvidia-cusolver-cu12-11.6.1.9 nvidia-cusparse-cu12-12.3.1.170 nvidia-nccl-cu12-2.21.5 nvidia-nvjitlink-cu12-12.4.127 nvidia-nvtx-cu12-12.4.127 sympy-1.13.1 torch-2.5.1 triton-3.1.0 typing-extensions-4.12.2
Successfully installed MarkupSafe-3.0.2 filelock-3.16.1 fsspec-2024.10.0 jinja2-3.1.4 mpmath-1.3.0 networkx-3.4.2 nvidia-cublas-cu12-12.4.5.8 nvidia-cuda-cupti-cu12-12.4.127 nvidia-cuda-nvrtc-cu12-12.4.127 nvidia-cuda-runtime-cu12-12.4.127 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.2.1.3 nvidia-curand-cu12-10.3.5.147 nvidia-cusolver-cu12-11.6.1.9 nvidia-cusparse-cu12-12.3.1.170 nvidia-nccl-cu12-2.21.5 nvidia-nvjitlink-cu12-12.4.127 nvidia-nvtx-cu12-12.4.127 sympy-1.13.1 torch-2.5.1 triton-3.1.0 typing-extensions-4.12.2
Cloning into 'ml-engineering'...
Cloning into 'ml-engineering'...
✓ Setup completed.  View logs at: ~/sky_logs/sky-2024-12-07-22-00-04-119559/setup-*.log
⚙︎ Job submitted, ID: 1
├── Waiting for task resources on 2 nodes.
└── Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:[rank4]:[W1208 04:54:21.114384985 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 4]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10478 [2] NCCL INFO cudaDriverVersion 12020
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10478 [2] NCCL INFO Bootstrap : Using eth0:10.60.0.4<0>
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10478 [2] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10478 [2] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10478 [2] NCCL INFO NET/Plugin: Using internal network plugin.
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10476 [0] NCCL INFO cudaDriverVersion 12020
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10476 [0] NCCL INFO Bootstrap : Using eth0:10.60.0.4<0>
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10476 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10476 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10476 [0] NCCL INFO NET/Plugin: Using internal network plugin.
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:[rank6]:[W1208 04:54:21.257679110 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 6]  using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11169 [0] NCCL INFO Bootstrap : Using eth0:10.60.0.5<0>
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11169 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11169 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11169 [0] NCCL INFO NET/Plugin: Using internal network plugin.
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11169 [0] NCCL INFO cudaDriverVersion 12020
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:NCCL version 2.21.5+cuda12.4
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:[rank0]:[W1208 04:54:21.135970731 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Failed to open libibverbs.so[.1]
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO NET/Socket : Using [0]eth0:10.60.0.4<0> [1]enP43105s1:fe80::20d:3aff:fef7:74b3%enP43105s1<0>
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Using non-device net plugin version 0
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Using network Socket
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO Failed to open libibverbs.so[.1]
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO NET/Socket : Using [0]eth0:10.60.0.4<0> [1]enP43105s1:fe80::20d:3aff:fef7:74b3%enP43105s1<0>
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO Using non-device net plugin version 0
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO Using network Socket
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:[rank5]:[W1208 04:54:21.405075893 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 5]  using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Failed to open libibverbs.so[.1]
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO NET/Socket : Using [0]eth0:10.60.0.5<0> [1]enP7906s1:fe80::20d:3aff:fef9:4aef%enP7906s1<0>
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Using non-device net plugin version 0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Using network Socket
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10479 [3] NCCL INFO cudaDriverVersion 12020
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10479 [3] NCCL INFO Bootstrap : Using eth0:10.60.0.4<0>
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10479 [3] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10479 [3] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10479 [3] NCCL INFO NET/Plugin: Using internal network plugin.
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10477 [1] NCCL INFO cudaDriverVersion 12020
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10477 [1] NCCL INFO Bootstrap : Using eth0:10.60.0.4<0>
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10477 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10477 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10477 [1] NCCL INFO NET/Plugin: Using internal network plugin.
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]:[W1208 04:54:21.475980063 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 7]  using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO Failed to open libibverbs.so[.1]
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO NET/Socket : Using [0]eth0:10.60.0.4<0> [1]enP43105s1:fe80::20d:3aff:fef7:74b3%enP43105s1<0>
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO Using non-device net plugin version 0
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO Using network Socket
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO Failed to open libibverbs.so[.1]
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO NET/Socket : Using [0]eth0:10.60.0.4<0> [1]enP43105s1:fe80::20d:3aff:fef7:74b3%enP43105s1<0>
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO Using non-device net plugin version 0
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO Using network Socket
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]:[W1208 04:54:21.526623416 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 3]  using GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11171 [2] NCCL INFO cudaDriverVersion 12020
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11171 [2] NCCL INFO Bootstrap : Using eth0:10.60.0.5<0>
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11171 [2] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11171 [2] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11171 [2] NCCL INFO NET/Plugin: Using internal network plugin.
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:[rank2]:[W1208 04:54:21.439277837 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 2]  using GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11172 [3] NCCL INFO cudaDriverVersion 12020
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11172 [3] NCCL INFO Bootstrap : Using eth0:10.60.0.5<0>
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11172 [3] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11172 [3] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11172 [3] NCCL INFO NET/Plugin: Using internal network plugin.
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO Failed to open libibverbs.so[.1]
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO NET/Socket : Using [0]eth0:10.60.0.5<0> [1]enP7906s1:fe80::20d:3aff:fef9:4aef%enP7906s1<0>
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO Using non-device net plugin version 0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO Using network Socket
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO Failed to open libibverbs.so[.1]
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO NET/Socket : Using [0]eth0:10.60.0.5<0> [1]enP7906s1:fe80::20d:3aff:fef9:4aef%enP7906s1<0>
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO Using non-device net plugin version 0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO Using network Socket
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11170 [1] NCCL INFO cudaDriverVersion 12020
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11170 [1] NCCL INFO Bootstrap : Using eth0:10.60.0.5<0>
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11170 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11170 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11170 [1] NCCL INFO NET/Plugin: Using internal network plugin.
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:[rank1]:[W1208 04:54:21.757081788 ProcessGroupNCCL.cpp:4115] [PG ID 0 PG GUID 0 Rank 1]  using GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device,or call init_process_group() with a device_id.
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO Failed to open libibverbs.so[.1]
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO NET/Socket : Using [0]eth0:10.60.0.5<0> [1]enP7906s1:fe80::20d:3aff:fef9:4aef%enP7906s1<0>
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO Using non-device net plugin version 0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO Using network Socket
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO ncclCommInitRank comm 0x8921db0 rank 1 nranks 8 cudaDev 1 nvmlDev 1 busId 200000 commId 0xa06c9deabe1f2a68 - Init START
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO ncclCommInitRank comm 0x8820300 rank 0 nranks 8 cudaDev 0 nvmlDev 0 busId 100000 commId 0xa06c9deabe1f2a68 - Init START
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO ncclCommInitRank comm 0x80fa6e0 rank 3 nranks 8 cudaDev 3 nvmlDev 3 busId 400000 commId 0xa06c9deabe1f2a68 - Init START
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO ncclCommInitRank comm 0x83b5aa0 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 300000 commId 0xa06c9deabe1f2a68 - Init START
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO ncclCommInitRank comm 0x8fe3600 rank 7 nranks 8 cudaDev 3 nvmlDev 3 busId 400000 commId 0xa06c9deabe1f2a68 - Init START
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO ncclCommInitRank comm 0x7492540 rank 6 nranks 8 cudaDev 2 nvmlDev 2 busId 300000 commId 0xa06c9deabe1f2a68 - Init START
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO ncclCommInitRank comm 0x84fdbc0 rank 4 nranks 8 cudaDev 0 nvmlDev 0 busId 100000 commId 0xa06c9deabe1f2a68 - Init START
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO ncclCommInitRank comm 0x7b81220 rank 5 nranks 8 cudaDev 1 nvmlDev 1 busId 200000 commId 0xa06c9deabe1f2a68 - Init START
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO Setting affinity for GPU 1 to ffff,ff000000
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO NVLS multicast support is not available on dev 1
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO comm 0x8921db0 rank 1 nRanks 8 nNodes 2 localRanks 4 localRank 1 MNNVL 0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO P2P Chunksize set to 131072
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO Channel 00 : 1[1] -> 2[2] via SHM/direct/direct
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO Channel 01 : 1[1] -> 2[2] via SHM/direct/direct
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Setting affinity for GPU 0 to ffffff
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO NVLS multicast support is not available on dev 0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO comm 0x8820300 rank 0 nRanks 8 nNodes 2 localRanks 4 localRank 0 MNNVL 0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 00/08 :    0   1   2   3   4   5   6   7
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 01/08 :    0   1   2   3   4   5   6   7
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 02/08 :    0   1   2   3   4   5   6   7
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 03/08 :    0   1   2   3   4   5   6   7
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 04/08 :    0   1   2   3   4   5   6   7
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 05/08 :    0   1   2   3   4   5   6   7
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 06/08 :    0   1   2   3   4   5   6   7
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 07/08 :    0   1   2   3   4   5   6   7
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Trees [0] 1/4/-1->0->-1 [1] 1/4/-1->0->-1 [2] 1/4/-1->0->-1 [3] 1/4/-1->0->-1 [4] 1/-1/-1->0->4 [5] 1/-1/-1->0->4 [6] 1/-1/-1->0->4 [7] 1/-1/-1->0->4
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO P2P Chunksize set to 131072
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO Setting affinity for GPU 3 to ffffff00,00000000,00000000
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO NVLS multicast support is not available on dev 3
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO comm 0x80fa6e0 rank 3 nRanks 8 nNodes 2 localRanks 4 localRank 3 MNNVL 0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2 [2] -1/-1/-1->3->2 [3] -1/-1/-1->3->2 [4] -1/-1/-1->3->2 [5] -1/-1/-1->3->2 [6] -1/-1/-1->3->2 [7] -1/-1/-1->3->2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO P2P Chunksize set to 131072
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffff0000,00000000
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO NVLS multicast support is not available on dev 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO comm 0x83b5aa0 rank 2 nRanks 8 nNodes 2 localRanks 4 localRank 2 MNNVL 0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO P2P Chunksize set to 131072
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO Setting affinity for GPU 3 to ffffff00,00000000,00000000
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO NVLS multicast support is not available on dev 3
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO comm 0x8fe3600 rank 7 nRanks 8 nNodes 2 localRanks 4 localRank 3 MNNVL 0
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO P2P Chunksize set to 131072
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffff0000,00000000
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO NVLS multicast support is not available on dev 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO comm 0x7492540 rank 6 nRanks 8 nNodes 2 localRanks 4 localRank 2 MNNVL 0
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO P2P Chunksize set to 131072
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Setting affinity for GPU 0 to ffffff
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO NVLS multicast support is not available on dev 0
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO comm 0x84fdbc0 rank 4 nRanks 8 nNodes 2 localRanks 4 localRank 0 MNNVL 0
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Trees [0] 5/-1/-1->4->0 [1] 5/-1/-1->4->0 [2] 5/-1/-1->4->0 [3] 5/-1/-1->4->0 [4] 5/0/-1->4->-1 [5] 5/0/-1->4->-1 [6] 5/0/-1->4->-1 [7] 5/0/-1->4->-1
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO P2P Chunksize set to 131072
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO Setting affinity for GPU 1 to ffff,ff000000
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO NVLS multicast support is not available on dev 1
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO comm 0x7b81220 rank 5 nRanks 8 nNodes 2 localRanks 4 localRank 1 MNNVL 0
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO P2P Chunksize set to 131072
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO Channel 00 : 5[1] -> 6[2] via SHM/direct/direct
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO Channel 01 : 5[1] -> 6[2] via SHM/direct/direct
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO Channel 00/0 : 7[3] -> 0[0] [send] via NET/Socket/0
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO Channel 01/0 : 7[3] -> 0[0] [send] via NET/Socket/1
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO Channel 02/0 : 7[3] -> 0[0] [send] via NET/Socket/0
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO Channel 03/0 : 7[3] -> 0[0] [send] via NET/Socket/1
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO Channel 04/0 : 7[3] -> 0[0] [send] via NET/Socket/0
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO Channel 05/0 : 7[3] -> 0[0] [send] via NET/Socket/1
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO Channel 06/0 : 7[3] -> 0[0] [send] via NET/Socket/0
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO Channel 07/0 : 7[3] -> 0[0] [send] via NET/Socket/1
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Channel 00/0 : 3[3] -> 4[0] [receive] via NET/Socket/0
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Channel 01/0 : 3[3] -> 4[0] [receive] via NET/Socket/1
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Channel 02/0 : 3[3] -> 4[0] [receive] via NET/Socket/0
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Channel 03/0 : 3[3] -> 4[0] [receive] via NET/Socket/1
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Channel 04/0 : 3[3] -> 4[0] [receive] via NET/Socket/0
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Channel 05/0 : 3[3] -> 4[0] [receive] via NET/Socket/1
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Channel 06/0 : 3[3] -> 4[0] [receive] via NET/Socket/0
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Channel 07/0 : 3[3] -> 4[0] [receive] via NET/Socket/1
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Channel 00/0 : 4[0] -> 5[1] via P2P/CUMEM/read
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Channel 01/0 : 4[0] -> 5[1] via P2P/CUMEM/read
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Channel 02/0 : 4[0] -> 5[1] via P2P/CUMEM/read
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Channel 03/0 : 4[0] -> 5[1] via P2P/CUMEM/read
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Channel 04/0 : 4[0] -> 5[1] via P2P/CUMEM/read
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Channel 05/0 : 4[0] -> 5[1] via P2P/CUMEM/read
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Channel 06/0 : 4[0] -> 5[1] via P2P/CUMEM/read
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:0]:nccl-e9ef-7d4f-1:10476:10516 [0] NCCL INFO Channel 07/0 : 4[0] -> 5[1] via P2P/CUMEM/read
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO Channel 00/0 : 6[2] -> 7[3] via P2P/CUMEM/read
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO Channel 01/0 : 6[2] -> 7[3] via P2P/CUMEM/read
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO Channel 02/0 : 6[2] -> 7[3] via P2P/CUMEM/read
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO Channel 03/0 : 6[2] -> 7[3] via P2P/CUMEM/read
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO Channel 04/0 : 6[2] -> 7[3] via P2P/CUMEM/read
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO Channel 05/0 : 6[2] -> 7[3] via P2P/CUMEM/read
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO Channel 06/0 : 6[2] -> 7[3] via P2P/CUMEM/read
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:2]:nccl-e9ef-7d4f-1:10478:10517 [2] NCCL INFO Channel 07/0 : 6[2] -> 7[3] via P2P/CUMEM/read
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO Channel 02 : 5[1] -> 6[2] via SHM/direct/direct
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO Channel 03 : 5[1] -> 6[2] via SHM/direct/direct
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO Channel 04 : 5[1] -> 6[2] via SHM/direct/direct
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO Channel 05 : 5[1] -> 6[2] via SHM/direct/direct
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO Channel 06 : 5[1] -> 6[2] via SHM/direct/direct
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:1]:nccl-e9ef-7d4f-1:10477:10518 [1] NCCL INFO Channel 07 : 5[1] -> 6[2] via SHM/direct/direct
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO Channel 02 : 1[1] -> 2[2] via SHM/direct/direct
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO Channel 03 : 1[1] -> 2[2] via SHM/direct/direct
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO Channel 04 : 1[1] -> 2[2] via SHM/direct/direct
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO Channel 05 : 1[1] -> 2[2] via SHM/direct/direct
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO Channel 06 : 1[1] -> 2[2] via SHM/direct/direct
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:1]:nccl-e9ef-7d4f-0:11170:11242 [1] NCCL INFO Channel 07 : 1[1] -> 2[2] via SHM/direct/direct
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 00/0 : 7[3] -> 0[0] [receive] via NET/Socket/0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 01/0 : 7[3] -> 0[0] [receive] via NET/Socket/1
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 02/0 : 7[3] -> 0[0] [receive] via NET/Socket/0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 03/0 : 7[3] -> 0[0] [receive] via NET/Socket/1
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 04/0 : 7[3] -> 0[0] [receive] via NET/Socket/0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 05/0 : 7[3] -> 0[0] [receive] via NET/Socket/1
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 06/0 : 7[3] -> 0[0] [receive] via NET/Socket/0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 07/0 : 7[3] -> 0[0] [receive] via NET/Socket/1
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM/read
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM/read
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM/read
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM/read
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM/read
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM/read
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM/read
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:0]:nccl-e9ef-7d4f-0:11169:11239 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM/read
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO Channel 00/0 : 3[3] -> 4[0] [send] via NET/Socket/0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO Channel 01/0 : 3[3] -> 4[0] [send] via NET/Socket/1
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO Channel 02/0 : 3[3] -> 4[0] [send] via NET/Socket/0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO Channel 03/0 : 3[3] -> 4[0] [send] via NET/Socket/1
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO Channel 04/0 : 3[3] -> 4[0] [send] via NET/Socket/0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO Channel 05/0 : 3[3] -> 4[0] [send] via NET/Socket/1
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO Channel 06/0 : 3[3] -> 4[0] [send] via NET/Socket/0
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO Channel 07/0 : 3[3] -> 4[0] [send] via NET/Socket/1
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO Channel 00/0 : 2[2] -> 3[3] via P2P/CUMEM/read
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO Channel 01/0 : 2[2] -> 3[3] via P2P/CUMEM/read
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO Channel 02/0 : 2[2] -> 3[3] via P2P/CUMEM/read
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO Channel 03/0 : 2[2] -> 3[3] via P2P/CUMEM/read
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO Channel 04/0 : 2[2] -> 3[3] via P2P/CUMEM/read
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO Channel 05/0 : 2[2] -> 3[3] via P2P/CUMEM/read
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO Channel 06/0 : 2[2] -> 3[3] via P2P/CUMEM/read
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:2]:nccl-e9ef-7d4f-0:11171:11240 [2] NCCL INFO Channel 07/0 : 2[2] -> 3[3] via P2P/CUMEM/read
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]: Traceback (most recent call last):
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]:   File "/home/azureuser/sky_workdir/ml-engineering/network/benchmarks/all_reduce_bench.py", line 148, in <module>
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]:     init_processes(local_rank=local_rank, fn=run)
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]:   File "/home/azureuser/sky_workdir/ml-engineering/network/benchmarks/all_reduce_bench.py", line 143, in init_processes
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]:     fn(local_rank)
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]:   File "/home/azureuser/sky_workdir/ml-engineering/network/benchmarks/all_reduce_bench.py", line 117, in run
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]:     timed_allreduce(mat, start_event, end_event)
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]:   File "/home/azureuser/sky_workdir/ml-engineering/network/benchmarks/all_reduce_bench.py", line 87, in timed_allreduce
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]:     dist.barrier()
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]:   File "/home/azureuser/miniconda3/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]:     return func(*args, **kwargs)
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]:   File "/home/azureuser/miniconda3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]:     work = group.barrier(opts=opts)
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]: Last error:
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]: socketStartConnect: Connect to fe80::20d:3aff:fef9:4aef%enP43105s1<32881> failed : Software caused connection abort
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to fe80::20d:3aff:fef9:4aef%enP43105s1<58165> failed : Software caused connection abort
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO misc/socket.cc:567 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO misc/socket.cc:589 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO transport/net_socket.cc:339 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO transport/net.cc:687 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to fe80::20d:3aff:fef9:4aef%enP43105s1<40997> failed : Software caused connection abort
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO misc/socket.cc:567 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO misc/socket.cc:589 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO transport/net_socket.cc:339 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO transport/net.cc:687 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO transport/net.cc:306 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO transport.cc:165 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to fe80::20d:3aff:fef9:4aef%enP43105s1<57649> failed : Software caused connection abort
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO misc/socket.cc:567 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO misc/socket.cc:589 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO transport/net_socket.cc:339 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO transport/net.cc:687 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to fe80::20d:3aff:fef9:4aef%enP43105s1<32881> failed : Software caused connection abort
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO misc/socket.cc:567 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO misc/socket.cc:589 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO transport/net_socket.cc:339 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO transport/net.cc:687 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO init.cc:1263 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO init.cc:1548 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10519 [3] NCCL INFO group.cc:64 -> 2 [Async thread]
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10479 [3] NCCL INFO group.cc:418 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10479 [3] NCCL INFO init.cc:1929 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) W1208 04:56:37.298000 10457 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 10476 closing signal SIGTERM
(worker1, rank=1, pid=4888, ip=10.60.0.4) W1208 04:56:37.299000 10457 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 10477 closing signal SIGTERM
(worker1, rank=1, pid=4888, ip=10.60.0.4) W1208 04:56:37.299000 10457 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 10478 closing signal SIGTERM
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]: Traceback (most recent call last):
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]:   File "/home/azureuser/sky_workdir/ml-engineering/network/benchmarks/all_reduce_bench.py", line 148, in <module>
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]:     init_processes(local_rank=local_rank, fn=run)
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]:   File "/home/azureuser/sky_workdir/ml-engineering/network/benchmarks/all_reduce_bench.py", line 143, in init_processes
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]:     fn(local_rank)
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]:   File "/home/azureuser/sky_workdir/ml-engineering/network/benchmarks/all_reduce_bench.py", line 117, in run
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]:     timed_allreduce(mat, start_event, end_event)
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]:   File "/home/azureuser/sky_workdir/ml-engineering/network/benchmarks/all_reduce_bench.py", line 87, in timed_allreduce
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]:     dist.barrier()
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]:   File "/home/azureuser/miniconda3/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]:     return func(*args, **kwargs)
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]:   File "/home/azureuser/miniconda3/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4159, in barrier
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]:     work = group.barrier(opts=opts)
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]: Last error:
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:[rank3]: socketStartConnect: Connect to fe80::20d:3aff:fef7:74b3%enP7906s1<55899> failed : Software caused connection abort
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to fe80::20d:3aff:fef7:74b3%enP7906s1<33437> failed : Software caused connection abort
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] NCCL INFO misc/socket.cc:567 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] NCCL INFO misc/socket.cc:589 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] NCCL INFO transport/net_socket.cc:339 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] NCCL INFO transport/net.cc:687 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to fe80::20d:3aff:fef7:74b3%enP7906s1<51577> failed : Software caused connection abort
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] NCCL INFO misc/socket.cc:567 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] NCCL INFO misc/socket.cc:589 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] NCCL INFO transport/net_socket.cc:339 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] NCCL INFO transport/net.cc:687 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO transport/net.cc:306 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO transport.cc:165 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to fe80::20d:3aff:fef7:74b3%enP7906s1<60379> failed : Software caused connection abort
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] NCCL INFO misc/socket.cc:567 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] NCCL INFO misc/socket.cc:589 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] NCCL INFO transport/net_socket.cc:339 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] NCCL INFO transport/net.cc:687 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO init.cc:1263 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO init.cc:1548 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to fe80::20d:3aff:fef7:74b3%enP7906s1<55899> failed : Software caused connection abort
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] NCCL INFO misc/socket.cc:567 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] NCCL INFO misc/socket.cc:589 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] NCCL INFO transport/net_socket.cc:339 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11249 [3] NCCL INFO transport/net.cc:687 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11241 [3] NCCL INFO group.cc:64 -> 2 [Async thread]
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11172 [3] NCCL INFO group.cc:418 -> 2
(head, rank=0, pid=5634) [nccl-e9ef-7d4f-0:3]:nccl-e9ef-7d4f-0:11172:11172 [3] NCCL INFO init.cc:1929 -> 2
(head, rank=0, pid=5634) W1208 04:56:37.836000 11083 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 11169 closing signal SIGTERM
(head, rank=0, pid=5634) W1208 04:56:37.836000 11083 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 11170 closing signal SIGTERM
(head, rank=0, pid=5634) W1208 04:56:37.837000 11083 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 11171 closing signal SIGTERM
(worker1, rank=1, pid=4888, ip=10.60.0.4) E1208 04:56:37.865000 10457 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 3 (pid: 10479) of binary: /home/azureuser/miniconda3/bin/python
(worker1, rank=1, pid=4888, ip=10.60.0.4) Traceback (most recent call last):
(worker1, rank=1, pid=4888, ip=10.60.0.4)   File "/home/azureuser/miniconda3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
(worker1, rank=1, pid=4888, ip=10.60.0.4)     return _run_code(code, main_globals, None,
(worker1, rank=1, pid=4888, ip=10.60.0.4)   File "/home/azureuser/miniconda3/lib/python3.10/runpy.py", line 86, in _run_code
(worker1, rank=1, pid=4888, ip=10.60.0.4)     exec(code, run_globals)
(worker1, rank=1, pid=4888, ip=10.60.0.4)   File "/home/azureuser/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 923, in <module>
(worker1, rank=1, pid=4888, ip=10.60.0.4)     main()
(worker1, rank=1, pid=4888, ip=10.60.0.4)   File "/home/azureuser/miniconda3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
(worker1, rank=1, pid=4888, ip=10.60.0.4)     return f(*args, **kwargs)
(worker1, rank=1, pid=4888, ip=10.60.0.4)   File "/home/azureuser/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
(worker1, rank=1, pid=4888, ip=10.60.0.4)     run(args)
(worker1, rank=1, pid=4888, ip=10.60.0.4)   File "/home/azureuser/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
(worker1, rank=1, pid=4888, ip=10.60.0.4)     elastic_launch(
(worker1, rank=1, pid=4888, ip=10.60.0.4)   File "/home/azureuser/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
(worker1, rank=1, pid=4888, ip=10.60.0.4)     return launch_agent(self._config, self._entrypoint, list(args))
(worker1, rank=1, pid=4888, ip=10.60.0.4)   File "/home/azureuser/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
(worker1, rank=1, pid=4888, ip=10.60.0.4)     raise ChildFailedError(
(worker1, rank=1, pid=4888, ip=10.60.0.4) torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
(worker1, rank=1, pid=4888, ip=10.60.0.4) ============================================================
(worker1, rank=1, pid=4888, ip=10.60.0.4) all_reduce_bench.py FAILED
(worker1, rank=1, pid=4888, ip=10.60.0.4) ------------------------------------------------------------
(worker1, rank=1, pid=4888, ip=10.60.0.4) Failures:
(worker1, rank=1, pid=4888, ip=10.60.0.4)   <NO_OTHER_FAILURES>
(worker1, rank=1, pid=4888, ip=10.60.0.4) ------------------------------------------------------------
(worker1, rank=1, pid=4888, ip=10.60.0.4) Root Cause (first observed failure):
(worker1, rank=1, pid=4888, ip=10.60.0.4) [0]:
(worker1, rank=1, pid=4888, ip=10.60.0.4)   time      : 2024-12-08_04:56:37
(worker1, rank=1, pid=4888, ip=10.60.0.4)   host      : nccl-e9ef-7d4f-1.internal.cloudapp.net
(worker1, rank=1, pid=4888, ip=10.60.0.4)   rank      : 7 (local_rank: 3)
(worker1, rank=1, pid=4888, ip=10.60.0.4)   exitcode  : 1 (pid: 10479)
(worker1, rank=1, pid=4888, ip=10.60.0.4)   error_file: <N/A>
(worker1, rank=1, pid=4888, ip=10.60.0.4)   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
(worker1, rank=1, pid=4888, ip=10.60.0.4) ============================================================
ERROR: Job 1 failed with return code list: [137, 1] 
✓ Job finished (status: FAILED).

Note this part which seems to be the root cause:

(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]: ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. 
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]: Last error:
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:[rank7]: socketStartConnect: Connect to fe80::20d:3aff:fef9:4aef%enP43105s1<32881> failed : Software caused connection abort
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] misc/socket.cc:484 NCCL WARN socketStartConnect: Connect to fe80::20d:3aff:fef9:4aef%enP43105s1<58165> failed : Software caused connection abort
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO misc/socket.cc:567 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO misc/socket.cc:589 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO transport/net_socket.cc:339 -> 2
(worker1, rank=1, pid=4888, ip=10.60.0.4) [nccl-e9ef-7d4f-1:3]:nccl-e9ef-7d4f-1:10479:10520 [3] NCCL INFO transport/net.cc:687 -> 2

Interestingly, if I revert to an older image, it works again:

diff --git a/sky/clouds/azure.py b/sky/clouds/azure.py
index edd5840d..9c159271 100644
--- a/sky/clouds/azure.py
+++ b/sky/clouds/azure.py
@@ -40,7 +40,7 @@ _DEFAULT_AZURE_UBUNTU_2004_IMAGE_GB = 150
 _DEFAULT_SKYPILOT_IMAGE_GB = 30
 
 _DEFAULT_CPU_IMAGE_ID = 'skypilot:custom-cpu-ubuntu-v2'
-_DEFAULT_GPU_IMAGE_ID = 'skypilot:custom-gpu-ubuntu-v2'
+_DEFAULT_GPU_IMAGE_ID = 'skypilot:gpu-ubuntu-2204' # 'skypilot:custom-gpu-ubuntu-v2'
 _DEFAULT_V1_IMAGE_ID = 'skypilot:custom-gpu-ubuntu-v1'
 _DEFAULT_GPU_K80_IMAGE_ID = 'skypilot:k80-ubuntu-2004'
 _FALLBACK_IMAGE_ID = 'skypilot:gpu-ubuntu-2204'

Note that with the older image you might have to add:

export LD_LIBRARY_PATH=/home/azureuser/miniconda3/lib/python3.10/site-packages/nvidia/nvjitlink/lib/:$LD_LIBRARY_PATH
export NCCL_IB_DISABLE=1

before running the test.

Updated diff:

diff --git a/examples/nccl_test.yaml b/examples/nccl_test.yaml
index 046e72cc..8b989496 100644
--- a/examples/nccl_test.yaml
+++ b/examples/nccl_test.yaml
@@ -19,7 +19,9 @@ name: torch-nccl-allreduce
 num_nodes: 2
 
 resources:
-  accelerators: A100:8
+  cloud: azure
+  region: westus2
+  accelerators: A100-80GB:4
   use_spot: True
 
 setup: |
@@ -30,7 +32,8 @@ run: |
   cd ml-engineering/network/benchmarks
   NNODES=`echo "$SKYPILOT_NODE_IPS" | wc -l`
   MASTER_ADDR=`echo "$SKYPILOT_NODE_IPS" | head -n1`
-  python -u -m torch.distributed.run \
+  export LD_LIBRARY_PATH=/home/azureuser/miniconda3/lib/python3.10/site-packages/nvidia/nvjitlink/lib/:$LD_LIBRARY_PATH
+  NCCL_DEBUG=INFO NCCL_IB_DISABLE=1 python -u -m torch.distributed.run \
     --nproc_per_node $SKYPILOT_NUM_GPUS_PER_NODE \
     --nnodes $NNODES \
     --rdzv_endpoint $MASTER_ADDR:8888 \
@@ -39,4 +42,4 @@ run: |
     --role `hostname -s`: \
     --tee 3 \
     all_reduce_bench.py
-    
\ No newline at end of file
+    

This leads me to think it is something related to the image.

Accelerated networking is needed to obtain reliable high-bandwidth interconnect for jobs such as dtrain.

Version & Commit info:

  • sky -v: skypilot, version 0.7.0
  • sky -c: skypilot, commit 3f62588-dirty
@concretevitamin
Copy link
Member

cc @yika-luo

@Michaelvll Michaelvll added the P0 label Dec 12, 2024
@Michaelvll Michaelvll added the OSS label Dec 19, 2024 — with Linear
@Michaelvll Michaelvll removed the OSS label Dec 19, 2024
@Michaelvll Michaelvll added the OSS label Dec 19, 2024 — with Linear
@Michaelvll Michaelvll removed the OSS label Dec 19, 2024
@yika-luo
Copy link
Collaborator

There seems to be some compatibility issues between Azure's accelerated networking and the Nvidia's NCCL configured on SkyPilot custom image. I sought help from Azure support and here's the response:

The issue likely stems from NCCL's network communication not being fully compatible with Azure's Accelerated Networking feature, which uses SR-IOV for enhanced performance. NCCL may face conflicts or degraded performance in certain configurations when Accelerated Networking is enabled. Disabling Accelerated Networking on the affected VMs can resolve the problem, but it may lead to a slight reduction in network performance. Alternatively, using optimized VM types or checking for NCCL updates that address compatibility issues could help.

https://learn.microsoft.com/en-us/answers/questions/2134325/accelerated-networking-issue-with-nccl?page=1#answer-1892347

So the recommendation is to either use another image or other optimized VM types.

@visatish
Copy link
Author

Hi @yika-luo, thanks for looking into this! I'm a bit confused as this was working before with an older skypilot image. I have provided instructions above to replicate the older stack under Interestingly, if I revert to an older image, it works again: - can you try this?

Some other notes:

  1. This is independent of RDMA: Accelerated networking improves even a standard point-to-point network bandwidth benchmark, and I've been using standard ethernet (no RoCE).
  2. There is no degraded performance w/ NCCL + Accelerated Networking: As I benchmarked, the performance only improves / is more stable.

@yika-luo yika-luo removed their assignment Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants