-
Notifications
You must be signed in to change notification settings - Fork 346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The comm/gemm overlap example failed with "ran out of input". #1363
Comments
@wujingyue I'm unable to reproduce this on my end with the nvcr.io/nvidia/pytorch:24.09-py3 container on an 8xH100 node. $ nvidia-smi -L
GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-3ede3ca0-38ae-e999-bee2-95b79e12296f)
GPU 1: NVIDIA H100 80GB HBM3 (UUID: GPU-7112490b-626a-f7a8-fefa-9dc618703986)
GPU 2: NVIDIA H100 80GB HBM3 (UUID: GPU-77981356-aee2-81ca-4f68-7f6143266b16)
GPU 3: NVIDIA H100 80GB HBM3 (UUID: GPU-917c0aa9-8c8c-35ca-46bd-8ad7e43e1d3a)
GPU 4: NVIDIA H100 80GB HBM3 (UUID: GPU-5aa869c4-000d-5d38-3e30-8584c4232790)
GPU 5: NVIDIA H100 80GB HBM3 (UUID: GPU-8254ceef-7321-ffaa-32fa-4db096991bd8)
GPU 6: NVIDIA H100 80GB HBM3 (UUID: GPU-e0b7b8cc-631f-ce22-6087-36c999c50c55)
GPU 7: NVIDIA H100 80GB HBM3 (UUID: GPU-0065f753-a758-4141-0f68-e58475eebcd7)
$ torchrun --nnodes=1 --nproc-per-node=2 te_layer_with_overlap.py --debug
W1212 20:57:50.840000 2754 torch/distributed/run.py:793]
W1212 20:57:50.840000 2754 torch/distributed/run.py:793] *****************************************
W1212 20:57:50.840000 2754 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W1212 20:57:50.840000 2754 torch/distributed/run.py:793] *****************************************
[rank0] Initialized default NCCL process group with 2 GPUs
[rank0] Created tensor-parallel group: [0, 1]
!!! [UB] Number of NVLink domains: 1
!!! [UB] Global ranks on domain 0: [0, 1]
!!! [UB] Create Userbuffers Communicator
UB_TIMEOUT is set to 110 sec, 217800000000 cycles, freq: 1980000khz
MC initialized succesfully, window size = 549755813888
!!! [UBP2P] Register UBuf 1
!!! [UBP2P] Register UBuf 2
!!! [UBP2P] Register UBuf 3
!!! [UBP2P] Register UBuf 4
!!! [UB] Register UBuf 5
!!! [UB] Register UBuf 6
!!! [UB] Register UBuf 7
!!! [UB] Register UBuf 8
!!! [UB] Register UBuf 9
!!! [UB] Register UBuf 10
[rank0] Starting training iterations...
[rank0] Iter 1
[rank0] |-- Generate random input batch
[rank0] |-- Forward pass
[rank0] |-- Compute loss
[rank0] |-- Backward pass
[rank0] |-- Optimizer step
[rank0] Iter 2
[rank0] |-- Generate random input batch
[rank0] |-- Forward pass
[rank0] |-- Compute loss
[rank0] |-- Backward pass
[rank0] |-- Optimizer step
[rank0] Iter 3
[rank0] |-- Generate random input batch
[rank0] |-- Forward pass
[rank0] |-- Compute loss
[rank0] |-- Backward pass
[rank0] |-- Optimizer step
[rank0] Iter 4
[rank0] |-- Generate random input batch
[rank0] |-- Forward pass
[rank0] |-- Compute loss
[rank0] |-- Backward pass
[rank0] |-- Optimizer step
[rank0] Iter 5
[rank0] |-- Generate random input batch
[rank0] |-- Forward pass
[rank0] |-- Compute loss
[rank0] |-- Backward pass
[rank0] |-- Optimizer step
[rank0] Finished training!
[rank0] Destroying all process groups...
Exiting... |
I'm closing this until I find a repro for you. I'm using a different Docker image and I'll have to investigate the differences between the two. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
https://github.com/NVIDIA/TransformerEngine/blob/main/examples/pytorch/comm_gemm_overlap/te_layer_with_overlap.py
The text was updated successfully, but these errors were encountered: