Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The comm/gemm overlap example failed with "ran out of input". #1363

Closed
wujingyue opened this issue Dec 9, 2024 · 2 comments
Closed

The comm/gemm overlap example failed with "ran out of input". #1363

wujingyue opened this issue Dec 9, 2024 · 2 comments
Assignees

Comments

@wujingyue
Copy link
Contributor

https://github.com/NVIDIA/TransformerEngine/blob/main/examples/pytorch/comm_gemm_overlap/te_layer_with_overlap.py

$ nvidia-smi -L
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
GPU 2: NVIDIA H100 80GB HBM3
GPU 3: NVIDIA H100 80GB HBM3
GPU 4: NVIDIA H100 80GB HBM3
GPU 5: NVIDIA H100 80GB HBM3
GPU 6: NVIDIA H100 80GB HBM3
GPU 7: NVIDIA H100 80GB HBM3

$ torchrun --nnodes=1 --nproc-per-node=2 te_layer_with_overlap.py --debug

[rank0]: Traceback (most recent call last):
[rank0]:   File "/opt/pytorch/nvfuser/te_layer_with_overlap.py", line 384, in <module>
[rank0]:     sys.exit(_train(_parse_args()))
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/pytorch/nvfuser/te_layer_with_overlap.py", line 323, in _train
[rank0]:     te.module.base.initialize_ub(
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/module/base.py", line 215, in initialize_ub
[rank0]:     torch.distributed.all_gather_object(domain_per_rank_list, mydomain, world_group)
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 3031, in all_gather_object
[rank0]:     object_list[i] = _tensor_to_object(tensor, tensor_size, group)
[rank0]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 2940, in _tensor_to_object
[rank0]:     return _unpickler(io.BytesIO(buf)).load()
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: EOFError: Ran out of input
[rank1]: Traceback (most recent call last):
[rank1]:   File "/opt/pytorch/nvfuser/te_layer_with_overlap.py", line 384, in <module>
[rank1]:     sys.exit(_train(_parse_args()))
[rank1]:              ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/pytorch/nvfuser/te_layer_with_overlap.py", line 323, in _train
[rank1]:     te.module.base.initialize_ub(
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/transformer_engine/pytorch/module/base.py", line 215, in initialize_ub
[rank1]:     torch.distributed.all_gather_object(domain_per_rank_list, mydomain, world_group)
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
[rank1]:     return func(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 3031, in all_gather_object
[rank1]:     object_list[i] = _tensor_to_object(tensor, tensor_size, group)
[rank1]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 2940, in _tensor_to_object
[rank1]:     return _unpickler(io.BytesIO(buf)).load()
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: EOFError: Ran out of input
@denera
Copy link
Collaborator

denera commented Dec 12, 2024

@wujingyue I'm unable to reproduce this on my end with the nvcr.io/nvidia/pytorch:24.09-py3 container on an 8xH100 node.

$ nvidia-smi -L
GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-3ede3ca0-38ae-e999-bee2-95b79e12296f)
GPU 1: NVIDIA H100 80GB HBM3 (UUID: GPU-7112490b-626a-f7a8-fefa-9dc618703986)
GPU 2: NVIDIA H100 80GB HBM3 (UUID: GPU-77981356-aee2-81ca-4f68-7f6143266b16)
GPU 3: NVIDIA H100 80GB HBM3 (UUID: GPU-917c0aa9-8c8c-35ca-46bd-8ad7e43e1d3a)
GPU 4: NVIDIA H100 80GB HBM3 (UUID: GPU-5aa869c4-000d-5d38-3e30-8584c4232790)
GPU 5: NVIDIA H100 80GB HBM3 (UUID: GPU-8254ceef-7321-ffaa-32fa-4db096991bd8)
GPU 6: NVIDIA H100 80GB HBM3 (UUID: GPU-e0b7b8cc-631f-ce22-6087-36c999c50c55)
GPU 7: NVIDIA H100 80GB HBM3 (UUID: GPU-0065f753-a758-4141-0f68-e58475eebcd7)

$ torchrun --nnodes=1 --nproc-per-node=2 te_layer_with_overlap.py --debug
W1212 20:57:50.840000 2754 torch/distributed/run.py:793] 
W1212 20:57:50.840000 2754 torch/distributed/run.py:793] *****************************************
W1212 20:57:50.840000 2754 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1212 20:57:50.840000 2754 torch/distributed/run.py:793] *****************************************
[rank0] Initialized default NCCL process group with 2 GPUs
[rank0] Created tensor-parallel group: [0, 1]
!!! [UB] Number of NVLink domains: 1
!!! [UB] Global ranks on domain 0: [0, 1]
!!! [UB] Create Userbuffers Communicator
UB_TIMEOUT is set to 110 sec, 217800000000 cycles, freq: 1980000khz
MC initialized succesfully, window size = 549755813888
!!! [UBP2P] Register UBuf 1
!!! [UBP2P] Register UBuf 2
!!! [UBP2P] Register UBuf 3
!!! [UBP2P] Register UBuf 4
!!! [UB] Register UBuf 5
!!! [UB] Register UBuf 6
!!! [UB] Register UBuf 7
!!! [UB] Register UBuf 8
!!! [UB] Register UBuf 9
!!! [UB] Register UBuf 10
[rank0] Starting training iterations...
[rank0]     Iter 1
[rank0]     |-- Generate random input batch
[rank0]     |-- Forward pass
[rank0]     |-- Compute loss
[rank0]     |-- Backward pass
[rank0]     |-- Optimizer step
[rank0]     Iter 2
[rank0]     |-- Generate random input batch
[rank0]     |-- Forward pass
[rank0]     |-- Compute loss
[rank0]     |-- Backward pass
[rank0]     |-- Optimizer step
[rank0]     Iter 3
[rank0]     |-- Generate random input batch
[rank0]     |-- Forward pass
[rank0]     |-- Compute loss
[rank0]     |-- Backward pass
[rank0]     |-- Optimizer step
[rank0]     Iter 4
[rank0]     |-- Generate random input batch
[rank0]     |-- Forward pass
[rank0]     |-- Compute loss
[rank0]     |-- Backward pass
[rank0]     |-- Optimizer step
[rank0]     Iter 5
[rank0]     |-- Generate random input batch
[rank0]     |-- Forward pass
[rank0]     |-- Compute loss
[rank0]     |-- Backward pass
[rank0]     |-- Optimizer step
[rank0] Finished training!
[rank0] Destroying all process groups...
Exiting...

@wujingyue
Copy link
Contributor Author

I'm closing this until I find a repro for you. I'm using a different Docker image and I'll have to investigate the differences between the two.

@wujingyue wujingyue closed this as not planned Won't fix, can't repro, duplicate, stale Dec 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants