Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: UbufP2PCommOverlap(): incompatible function arguments. #1365

Closed
sallyjunjun opened this issue Dec 11, 2024 · 5 comments
Closed

TypeError: UbufP2PCommOverlap(): incompatible function arguments. #1365

sallyjunjun opened this issue Dec 11, 2024 · 5 comments

Comments

@sallyjunjun
Copy link

Hello,
When I configured --sequence-parallel and --tp-comm-overlap and started the training. It shows below information:
TypeError: UbufP2PCommOverlap(): incompatible function arguments. The following argument types are supported:
1. () -> None

Invoked with: tensor([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]], device='cuda:7', dtype=torch.bfloat16), 7, 2, 16, 2, 0, 0, 3, 0, 0, tensor([])
How to fix it? Thanks.

@denera
Copy link
Collaborator

denera commented Dec 11, 2024

Hi @sallyjunjun -- TE does not maintain its own models/applications so it does not parse any command-line options like --sequence-parallel or --tp-comm-overlap. Are you using TE's TP comm overlap via another package like NeMo or a custom application? If so, I would recommend opening this issue with the developers of that package/application instead to make sure their code is invoking TE API correctly.

On a related note, UbufP2PCommOverlap is deprecated API so the application you're trying to run must be using an older version of TE. Please check in with that application's developers about the possibility of updating to the latest release.

@sallyjunjun
Copy link
Author

Thank you for your suggestion. I will try to upgrade TransformerEngine first.

@sallyjunjun
Copy link
Author

when I upgrade TransformerEngine, I met the problem ModuleNotFoundError: No module named 'torch'. But I have installed torch indeed. Do you know how to fix it?

The detailed information is as follows:
Processing /mnt/hwfile/geruijun/code/Megatron-mwiacx/TransformerEngine
Installing build dependencies: started
Installing build dependencies: finished with status 'done'
Getting requirements to build wheel: started
Getting requirements to build wheel: finished with status 'error'
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> [17 lines of output]
Traceback (most recent call last):
File "/mnt/petrelfs/geruijun/miniconda3-new/envs/llm-cuda12.2-nemo/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in
main()
File "/mnt/petrelfs/geruijun/miniconda3-new/envs/llm-cuda12.2-nemo/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
json_out['return_val'] = hook(**hook_input['kwargs'])
File "/mnt/petrelfs/geruijun/miniconda3-new/envs/llm-cuda12.2-nemo/lib/python3.10/site-packages/pip/_vendor/pyproject_hooks/in_process/in_process.py", line 118, in get_requires_for_build_wheel
return hook(config_settings)
File "/tmp/pip-build-env-ncz8mpb
/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 334, in get_requires_for_build_wheel
return self.get_build_requires(config_settings, requirements=[])
File "/tmp/pip-build-env-ncz8mpb
/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 304, in get_build_requires
self.run_setup()
File "/tmp/pip-build-env-ncz8mpb
/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 522, in run_setup
super().run_setup(setup_script=setup_script)
File "/tmp/pip-build-env-ncz8mpb
/overlay/lib/python3.10/site-packages/setuptools/build_meta.py", line 320, in run_setup
exec(code, locals())
File "", line 37, in
ModuleNotFoundError: No module named 'torch'
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
srun: error: HOST-10-140-60-3: task 0: Exited with exit code 1

pip show torch
Name: torch
Version: 2.1.0+cu121
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: [email protected]
License: BSD-3
Location: /mnt/hwfile/geruijun/miniconda3-new/envs/llm-cuda12.2-nemo/lib/python3.10/site-packages
Requires: filelock, fsspec, jinja2, networkx, sympy, triton, typing-extensions
Required-by: accelerate, accelerated-scan, causal-conv1d, deepspeed, flash-attn, peft, pytorch-lightning, sentence-transformers, timm, torchaudio, torchmetrics, torchvision, transformer-engine

@denera
Copy link
Collaborator

denera commented Dec 14, 2024

This sounds like a problem with your conda environment. You may be trying to install Transformer Layer in a different environment than where you installed PyTorch.

Also, if you're interacting with Transformer Engine through another package like NeMo or Megatron-LM, I would recommend updating that to a newer version (following their instructions, not ours) and letting it pull the correct Transformer Engine version it depends on. Otherwise, you will likely run into API mismatches between them.

@sallyjunjun
Copy link
Author

Ok, I see. Thank you for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants