-
Notifications
You must be signed in to change notification settings - Fork 337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TypeError: UbufP2PCommOverlap(): incompatible function arguments. #1365
Comments
Hi @sallyjunjun -- TE does not maintain its own models/applications so it does not parse any command-line options like On a related note, |
Thank you for your suggestion. I will try to upgrade TransformerEngine first. |
when I upgrade TransformerEngine, I met the problem ModuleNotFoundError: No module named 'torch'. But I have installed torch indeed. Do you know how to fix it? The detailed information is as follows: × Getting requirements to build wheel did not run successfully. note: This error originates from a subprocess, and is likely not a problem with pip. × Getting requirements to build wheel did not run successfully. note: This error originates from a subprocess, and is likely not a problem with pip. pip show torch |
This sounds like a problem with your conda environment. You may be trying to install Transformer Layer in a different environment than where you installed PyTorch. Also, if you're interacting with Transformer Engine through another package like NeMo or Megatron-LM, I would recommend updating that to a newer version (following their instructions, not ours) and letting it pull the correct Transformer Engine version it depends on. Otherwise, you will likely run into API mismatches between them. |
Ok, I see. Thank you for your help. |
Hello,
When I configured --sequence-parallel and --tp-comm-overlap and started the training. It shows below information:
TypeError: UbufP2PCommOverlap(): incompatible function arguments. The following argument types are supported:
1. () -> None
Invoked with: tensor([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]], device='cuda:7', dtype=torch.bfloat16), 7, 2, 16, 2, 0, 0, 3, 0, 0, tensor([])
How to fix it? Thanks.
The text was updated successfully, but these errors were encountered: