The comm/gemm overlap example failed with "ran out of input". #1363

wujingyue · 2024-12-09T23:28:01Z

https://github.com/NVIDIA/TransformerEngine/blob/main/examples/pytorch/comm_gemm_overlap/te_layer_with_overlap.py

$ nvidia-smi -L
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
GPU 2: NVIDIA H100 80GB HBM3
GPU 3: NVIDIA H100 80GB HBM3
GPU 4: NVIDIA H100 80GB HBM3
GPU 5: NVIDIA H100 80GB HBM3
GPU 6: NVIDIA H100 80GB HBM3
GPU 7: NVIDIA H100 80GB HBM3

$ torchrun --nnodes=1 --nproc-per-node=2 te_layer_with_overlap.py --debug

[rank0]: Traceback (most recent call last):
[rank0]:   File "/opt/pytorch/nvfuser/te_layer_with_overlap.py", line 384, in <module>
[rank0]:     sys.exit(_train(_parse_args()))
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/pytorch/nvfuser/te_layer_with_overlap.py", line 323, in _train
[rank0]:     te.module.base.initialize_ub(
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/module/base.py", line 215, in initialize_ub
[rank0]:     torch.distributed.all_gather_object(domain_per_rank_list, mydomain, world_group)
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 3031, in all_gather_object
[rank0]:     object_list[i] = _tensor_to_object(tensor, tensor_size, group)
[rank0]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 2940, in _tensor_to_object
[rank0]:     return _unpickler(io.BytesIO(buf)).load()
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: EOFError: Ran out of input
[rank1]: Traceback (most recent call last):
[rank1]:   File "/opt/pytorch/nvfuser/te_layer_with_overlap.py", line 384, in <module>
[rank1]:     sys.exit(_train(_parse_args()))
[rank1]:              ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/pytorch/nvfuser/te_layer_with_overlap.py", line 323, in _train
[rank1]:     te.module.base.initialize_ub(
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/module/base.py", line 215, in initialize_ub
[rank1]:     torch.distributed.all_gather_object(domain_per_rank_list, mydomain, world_group)
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank1]:     return func(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 3031, in all_gather_object
[rank1]:     object_list[i] = _tensor_to_object(tensor, tensor_size, group)
[rank1]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 2940, in _tensor_to_object
[rank1]:     return _unpickler(io.BytesIO(buf)).load()
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: EOFError: Ran out of input

The text was updated successfully, but these errors were encountered:

denera · 2024-12-12T21:00:16Z

@wujingyue I'm unable to reproduce this on my end with the nvcr.io/nvidia/pytorch:24.09-py3 container on an 8xH100 node.

$ nvidia-smi -L
GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-3ede3ca0-38ae-e999-bee2-95b79e12296f)
GPU 1: NVIDIA H100 80GB HBM3 (UUID: GPU-7112490b-626a-f7a8-fefa-9dc618703986)
GPU 2: NVIDIA H100 80GB HBM3 (UUID: GPU-77981356-aee2-81ca-4f68-7f6143266b16)
GPU 3: NVIDIA H100 80GB HBM3 (UUID: GPU-917c0aa9-8c8c-35ca-46bd-8ad7e43e1d3a)
GPU 4: NVIDIA H100 80GB HBM3 (UUID: GPU-5aa869c4-000d-5d38-3e30-8584c4232790)
GPU 5: NVIDIA H100 80GB HBM3 (UUID: GPU-8254ceef-7321-ffaa-32fa-4db096991bd8)
GPU 6: NVIDIA H100 80GB HBM3 (UUID: GPU-e0b7b8cc-631f-ce22-6087-36c999c50c55)
GPU 7: NVIDIA H100 80GB HBM3 (UUID: GPU-0065f753-a758-4141-0f68-e58475eebcd7)

$ torchrun --nnodes=1 --nproc-per-node=2 te_layer_with_overlap.py --debug
W1212 20:57:50.840000 2754 torch/distributed/run.py:793] 
W1212 20:57:50.840000 2754 torch/distributed/run.py:793] *****************************************
W1212 20:57:50.840000 2754 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1212 20:57:50.840000 2754 torch/distributed/run.py:793] *****************************************
[rank0] Initialized default NCCL process group with 2 GPUs
[rank0] Created tensor-parallel group: [0, 1]
!!! [UB] Number of NVLink domains: 1
!!! [UB] Global ranks on domain 0: [0, 1]
!!! [UB] Create Userbuffers Communicator
UB_TIMEOUT is set to 110 sec, 217800000000 cycles, freq: 1980000khz
MC initialized succesfully, window size = 549755813888
!!! [UBP2P] Register UBuf 1
!!! [UBP2P] Register UBuf 2
!!! [UBP2P] Register UBuf 3
!!! [UBP2P] Register UBuf 4
!!! [UB] Register UBuf 5
!!! [UB] Register UBuf 6
!!! [UB] Register UBuf 7
!!! [UB] Register UBuf 8
!!! [UB] Register UBuf 9
!!! [UB] Register UBuf 10
[rank0] Starting training iterations...
[rank0]     Iter 1
[rank0]     |-- Generate random input batch
[rank0]     |-- Forward pass
[rank0]     |-- Compute loss
[rank0]     |-- Backward pass
[rank0]     |-- Optimizer step
[rank0]     Iter 2
[rank0]     |-- Generate random input batch
[rank0]     |-- Forward pass
[rank0]     |-- Compute loss
[rank0]     |-- Backward pass
[rank0]     |-- Optimizer step
[rank0]     Iter 3
[rank0]     |-- Generate random input batch
[rank0]     |-- Forward pass
[rank0]     |-- Compute loss
[rank0]     |-- Backward pass
[rank0]     |-- Optimizer step
[rank0]     Iter 4
[rank0]     |-- Generate random input batch
[rank0]     |-- Forward pass
[rank0]     |-- Compute loss
[rank0]     |-- Backward pass
[rank0]     |-- Optimizer step
[rank0]     Iter 5
[rank0]     |-- Generate random input batch
[rank0]     |-- Forward pass
[rank0]     |-- Compute loss
[rank0]     |-- Backward pass
[rank0]     |-- Optimizer step
[rank0] Finished training!
[rank0] Destroying all process groups...
Exiting...

wujingyue · 2024-12-13T01:05:00Z

I'm closing this until I find a repro for you. I'm using a different Docker image and I'll have to investigate the differences between the two.

wujingyue mentioned this issue Dec 10, 2024

Benchmark sequence parallelism in test_transformer_engine NVIDIA/Fuser#3546

Merged

phu0ngng assigned denera Dec 11, 2024

wujingyue closed this as not planned Won't fix, can't repro, duplicate, stale Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The comm/gemm overlap example failed with "ran out of input". #1363

The comm/gemm overlap example failed with "ran out of input". #1363

wujingyue commented Dec 9, 2024

denera commented Dec 12, 2024

wujingyue commented Dec 13, 2024

The comm/gemm overlap example failed with "ran out of input". #1363

The comm/gemm overlap example failed with "ran out of input". #1363

Comments

wujingyue commented Dec 9, 2024

denera commented Dec 12, 2024

wujingyue commented Dec 13, 2024