TypeError: UbufP2PCommOverlap(): incompatible function arguments. #1365

sallyjunjun · 2024-12-11T10:36:24Z

Hello,
When I configured --sequence-parallel and --tp-comm-overlap and started the training. It shows below information:
TypeError: UbufP2PCommOverlap(): incompatible function arguments. The following argument types are supported:
1. () -> None

Invoked with: tensor([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]], device='cuda:7', dtype=torch.bfloat16), 7, 2, 16, 2, 0, 0, 3, 0, 0, tensor([])
How to fix it? Thanks.

denera · 2024-12-11T17:21:06Z

Hi @sallyjunjun -- TE does not maintain its own models/applications so it does not parse any command-line options like --sequence-parallel or --tp-comm-overlap. Are you using TE's TP comm overlap via another package like NeMo or a custom application? If so, I would recommend opening this issue with the developers of that package/application instead to make sure their code is invoking TE API correctly.

On a related note, UbufP2PCommOverlap is deprecated API so the application you're trying to run must be using an older version of TE. Please check in with that application's developers about the possibility of updating to the latest release.

sallyjunjun · 2024-12-13T03:06:13Z

Thank you for your suggestion. I will try to upgrade TransformerEngine first.

sallyjunjun · 2024-12-13T07:37:22Z

when I upgrade TransformerEngine, I met the problem ModuleNotFoundError: No module named 'torch'. But I have installed torch indeed. Do you know how to fix it?

The detailed information is as follows:
Processing /mnt/hwfile/geruijun/code/Megatron-mwiacx/TransformerEngine
Installing build dependencies: started
Installing build dependencies: finished with status 'done'
Getting requirements to build wheel: started
Getting requirements to build wheel: finished with status 'error'
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [17 lines of output]
Traceback (most recent call last):
File "/mnt/petrelfs/geruijun/miniconda3-new/envs/llm-cuda12.2-nemo/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in
main()
File "/mnt/petrelfs/geruijun/miniconda3-new/envs/llm-cuda12.2-nemo/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "/mnt/petrelfs/geruijun/miniconda3-new/envs/llm-cuda12.2-nemo/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/in_process/in_process.py", line 118, in get_requires_for_build_wheel
return hook(config_settings)
File "/tmp/pip-build-env-ncz8mpb/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 334, in get_requires_for_build_wheel
return self.get_build_requires(config_settings, requirements=[])
File "/tmp/pip-build-env-ncz8mpb/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 304, in get_build_requires
self.run_setup()
File "/tmp/pip-build-env-ncz8mpb/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 522, in run_setup
super().run_setup(setup_script=setup_script)
File "/tmp/pip-build-env-ncz8mpb/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 320, in run_setup
exec(code, locals())
File "", line 37, in
ModuleNotFoundError: No module named 'torch'
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
srun: error: HOST-10-140-60-3: task 0: Exited with exit code 1

pip show torch
Name: torch
Version: 2.1.0+cu121
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: [email protected]
License: BSD-3
Location: /mnt/hwfile/geruijun/miniconda3-new/envs/llm-cuda12.2-nemo/lib/python3.10/site-packages
Requires: filelock, fsspec, jinja2, networkx, sympy, triton, typing-extensions
Required-by: accelerate, accelerated-scan, causal-conv1d, deepspeed, flash-attn, peft, pytorch-lightning, sentence-transformers, timm, torchaudio, torchmetrics, torchvision, transformer-engine

denera · 2024-12-14T00:49:26Z

This sounds like a problem with your conda environment. You may be trying to install Transformer Layer in a different environment than where you installed PyTorch.

Also, if you're interacting with Transformer Engine through another package like NeMo or Megatron-LM, I would recommend updating that to a newer version (following their instructions, not ours) and letting it pull the correct Transformer Engine version it depends on. Otherwise, you will likely run into API mismatches between them.

sallyjunjun · 2024-12-19T06:22:35Z

Ok, I see. Thank you for your help.

sallyjunjun closed this as completed Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TypeError: UbufP2PCommOverlap(): incompatible function arguments. #1365

TypeError: UbufP2PCommOverlap(): incompatible function arguments. #1365

sallyjunjun commented Dec 11, 2024

denera commented Dec 11, 2024

sallyjunjun commented Dec 13, 2024

sallyjunjun commented Dec 13, 2024

denera commented Dec 14, 2024

sallyjunjun commented Dec 19, 2024

TypeError: UbufP2PCommOverlap(): incompatible function arguments. #1365

TypeError: UbufP2PCommOverlap(): incompatible function arguments. #1365

Comments

sallyjunjun commented Dec 11, 2024

denera commented Dec 11, 2024

sallyjunjun commented Dec 13, 2024

sallyjunjun commented Dec 13, 2024

denera commented Dec 14, 2024

sallyjunjun commented Dec 19, 2024