Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AttributeError: module 'torch.distributed' has no attribute '_all_gather_base' #1532

Open
yuheyuan opened this issue Nov 5, 2022 · 15 comments

Comments

@yuheyuan
Copy link

yuheyuan commented Nov 5, 2022

    from apex.transformer.utils import split_tensor_into_1d_equal_chunks
  File "/home/ailab/anaconda3/envs/yy_FAFS/lib/python3.8/site-packages/apex/transformer/utils.py", line 11, in <module>
    torch.distributed.all_gather_into_tensor = torch.distributed._all_gather_base
AttributeError: module 'torch.distributed' has no attribute '_all_gather_base'

my version is

python                    3.8.13 
torch                     1.7.1+cu110              pypi_0    pypi
torchaudio                0.7.2                    pypi_0    pypi
torchvision               0.8.2+cu110              pypi_0    pypi
tqdm                      4.64.1  
@crcrpar
Copy link
Collaborator

crcrpar commented Nov 5, 2022

that pytorch is a bit too old for the current master branch of this repo.
some of older branches e.g. 22.04-dev could work for your environment.

@yuheyuan
Copy link
Author

yuheyuan commented Nov 5, 2022

that pytorch is a bit too old for the current master branch of this repo. some of older branches e.g. 22.04-dev could work for your environment.

oh, thank you
I want to know which version torch can support the apex. My GPU is 3090Ti

@huangsiyong
Copy link

hi i have the same problem.
my version is
torch 1.8.0 pypi_0 pypi
torchaudio 0.8.0 pypi_0 pypi
torchvision 0.9.0 pypi_0 pypi
tqdm 4.64.1 pypi_0 pypi

and the cuda version is 10.2

@yuheyuan
Copy link
Author

yuheyuan commented Nov 7, 2022

hi i have the same problem. my version is torch 1.8.0 pypi_0 pypi torchaudio 0.8.0 pypi_0 pypi torchvision 0.9.0 pypi_0 pypi tqdm 4.64.1 pypi_0 pypi

and the cuda version is 10.2

I use some of older branches e.g. 22.04-dev could work for my environment. It looks ok. But I don't test anthoer code. You can have try

@S16201512
Copy link

that pytorch is a bit too old for the current master branch of this repo. some of older branches e.g. 22.04-dev could work for your environment.

Yeap, I got it.
Thank you, it is useful for me.

@xiaomingxige
Copy link

that pytorch is a bit too old for the current master branch of this repo. some of older branches e.g. 22.04-dev could work for your environment.
Thanks. I have sovled it.

@YijuGuo
Copy link

YijuGuo commented Nov 20, 2022

嗨,我有同样的问题。我的版本是 torch 1.8.0 pypi_0 pypi torchaudio 0.8.0 pypi_0 pypi torchvision 0.9.0 pypi_0 pypi tqdm 4.64.1 pypi_0 pypi
cuda版本是10.2

我使用一些较旧的分支,例如 22.04-dev 可以适用于我的环境。看起来不错。但我不测试 anthoer 代码。你可以试试

您好,我想请问一下,pytorch的版本应该更改为多少的时候,是可以不出现这个报错吗,谢谢

@YijuGuo
Copy link

YijuGuo commented Nov 20, 2022

嗨,我有同样的问题。 我的版本是 torch 1.8.0 pypi_0 pypi torchaudio 0.8.0 pypi_0 pypi torchvision 0.9.0 pypi_0 pypi tqdm 4.64.1 pypi_0 pypi

cuda版本是10.2

您好,我pytorch版本和cuda版本跟您是一样的,我想请问一下您现在解决了这个问题吗

@S16201512
Copy link

嗨,我有同样的问题。 我的版本是 torch 1.8.0 pypi_0 pypi torchaudio 0.8.0 pypi_0 pypi torchvision 0.9.0 pypi_0 pypi tqdm 4.64.1 pypi_0 pypi
cuda版本是10.2

您好,我pytorch版本和cuda版本跟您是一样的,我想请问一下您现在解决了这个问题吗

select 22.04-dev as the cloned rep.

@S16201512
Copy link

嗨,我有同样的问题。 我的版本是 torch 1.8.0 pypi_0 pypi torchaudio 0.8.0 pypi_0 pypi torchvision 0.9.0 pypi_0 pypi tqdm 4.64.1 pypi_0 pypi
cuda版本是10.2

您好,我pytorch版本和cuda版本跟您是一样的,我想请问一下您现在解决了这个问题吗

Now the apex is the master. so please check out the branck to 22.04-dev and then git clone...

@YijuGuo
Copy link

YijuGuo commented Nov 20, 2022

嗨,我有同样的问题。 我的版本是 torch 1.8.0 pypi_0 pypi torchaudio 0.8.0 pypi_0 pypi torchvision 0.9.0 pypi_0 pypi tqdm 4.64.1 pypi_0 pypi
cuda版本是10.2

您好,我pytorch版本和cuda版本跟您是一样的,我想请问一下您现在解决了这个问题吗

Now the apex is the master. so please check out the branck to 22.04-dev and then git clone...

Okay, I got it.
Thank you.

@trandangtrungduc
Copy link

trandangtrungduc commented Dec 2, 2022

that pytorch is a bit too old for the current master branch of this repo. some of older branches e.g. 22.04-dev could work for your environment.

I tried to some older version but I got another error "Expected object of scalar type Long but got scalar type Int for argument #2 'target' in call to _thnn_nll_loss_forward". Has anyone seen this error? Thank you.

Python 3.8.10
Pytorch 1.8.0 (torchvision 0.9.0)

@tusharkhurana841
Copy link

tusharkhurana841 commented Dec 6, 2022

嗨,我有同样的问题。 我的版本是 torch 1.8.0 pypi_0 pypi torchaudio 0.8.0 pypi_0 pypi torchvision 0.9.0 pypi_0 pypi tqdm 4.64.1 pypi_0 pypi
cuda版本是10.2

您好,我pytorch版本和cuda版本跟您是一样的,我想请问一下您现在解决了这个问题吗

Now the apex is the master. so please check out the branck to 22.04-dev and then git clone...

Okay, I got it. Thank you.

Does using 22.04-dev resolves your error?

@Shao1Fan
Copy link

Shao1Fan commented Dec 27, 2022

  1. Globally search "The following 4 lines are for backward comparability with". You will find you should comment on some code lines in some files of apex because torch's version is too old.
  2. vi ~/.bashrc
  3. Add "export TORCH_CUDA_ARCH_LIST="8.0" " at the end. You need to change "8.0" depending on your GPU. Look for it at https://developer.nvidia.com/cuda-gpus#compute . For example my GPU is titanxp so I change "8.0" to "6.1".
  4. source ~/.bashrc
  5. pip uninstall apex
  6. cd apex
  7. pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

@daeunni
Copy link

daeunni commented Mar 26, 2023

MILVLG/bottom-up-attention.pytorch#98 (comment)

It works on me. :)

akhilkedia added a commit to akhilkedia/ConvNeXt-V2 that referenced this issue Jun 20, 2023
The latest version of apex currently does not install, as mentioned here facebookresearch#52.

This issue with apex has also been reported here NVIDIA/apex#1679

huggingface/transformers#24351 suggests pinning apex to a specific commit, `cd apex && git checkout 82ee367f3da74b4cd62a1fb47aa9806f0f47b58b`, after which apex installs successfully.

However, that version of apex is incompatible with the version of torch used here, and I get this error NVIDIA/apex#1532.

The previous link suggest using version `22.04-dev` (`cd apex && git checkout 22.04-dev`) of apex. With this, apex compiles successfully and `python ./main_finetune.py` also runs training using amp successfully.

If the authors can tell us the exact HEAD commit of apex version that they used, we can use that version instead!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants