Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #26

Open
PeterAJansen opened this issue May 20, 2020 · 14 comments
Open

cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #26

PeterAJansen opened this issue May 20, 2020 · 14 comments

Comments

@PeterAJansen
Copy link

Hi,

I'm seeing the same error as another person posted --

(alfred_env) (base) peter@neutronium:~/github/alfred$ python models/train/train_seq2seq.py --data data/json_feat_2.1.0 --model seq2seq_im_mask --dout exp/model:{model},name:pm_and_subgoals_01 --splits data/splits/oct21.json --gpu --batch 8 --pm_aux_loss_wt 0.1 --subgoal_aux_loss_wt 0.1 Namespace(action_loss_wt=1.0, actor_dropout=0.0, attn_dropout=0.0, batch=8, data='data/json_feat_2.1.0', dataset_fraction=0, dec_teacher_forcing=False, decay_epoch=10, demb=100, dframe=2500, dhid=512, dout='exp/model:seq2seq_im_mask,name:pm_and_subgoals_01', epoch=20, fast_epoch=False, gpu=True, hstate_dropout=0.3, input_dropout=0.0, lang_dropout=0.0, lr=0.0001, mask_loss_wt=1.0, model='seq2seq_im_mask', pframe=300, pm_aux_loss_wt=0.1, pp_folder='pp', preprocess=False, resume=None, save_every_epoch=False, seed=123, splits='data/splits/oct21.json', subgoal_aux_loss_wt=0.1, temp_no_history=False, vis_dropout=0.3, zero_goal=False, zero_instr=False) {'tests_seen': 1533, 'tests_unseen': 1529, 'train': 21023, 'valid_seen': 820, 'valid_unseen': 821} Traceback (most recent call last): File "models/train/train_seq2seq.py", line 103, in <module> model = model.to(torch.device('cuda')) File "/home/peter/github/alfred_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 386, in to return self._apply(convert) File "/home/peter/github/alfred_env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 193, in _apply module._apply(fn) File "/home/peter/github/alfred_env/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 127, in _apply self.flatten_parameters() File "/home/peter/github/alfred_env/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters self.batch_first, bool(self.bidirectional)) RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

I have verified that I've followed the installation instructions, that that the correct versions of torch (1.1.0), Torchvision (0.3.0 in requirements.txt; the prose says 1.3.0 but the latest version is 0.6.0), AI2THOR (2.1.0), and tensorboardX (1.8) have been installed.

I'm using a Titan RTX and CUDA 10.1 on KUbuntu 18.04.

Model seems to start training without the --gpu option, but it appears slow (so I didn't wait to see how long it would take).

thanks!

@MohitShridhar
Copy link
Collaborator

@PeterAJansen can you try a smaller batch size? Something less than 8?

@PeterAJansen
Copy link
Author

@MohitShridhar I forgot to mention this too -- smaller batch sizes produced the same error. The Titan RTX has 24gb of RAM, hopefully plenty for moderate batch sizes.

@MohitShridhar
Copy link
Collaborator

MohitShridhar commented May 21, 2020

Ah I see. Have you seen this? This error is being thrown by the PyTorch RNN module, so I am not sure what's happening here.

It seems like you need to build PyTorch with the right CUDA version?

@SouLeo
Copy link

SouLeo commented Jul 20, 2020

@PeterAJansen did you make any progress on this? I just purchased a RTX 2080S, performed a fresh install of Ubuntu 18.04,
downloaded the recommended pytorch version (1.5.1), and my CUDA version is 10.2. Despite all this effort, I still get the same error as you.

@PeterAJansen
Copy link
Author

PeterAJansen commented Jul 20, 2020 via email

@MohitShridhar
Copy link
Collaborator

Sorry, I wish I could help, but I don't have a RTX 2080S to debug this.

@SouLeo
Copy link

SouLeo commented Jul 21, 2020

No worries! I think I figured out that it might be an OOM issue. I ran it a couple of times on my 8GB GPU and saw that the training program nearly used all 8 GB.

Then after rerunning the training and changing absolutely nothing regarding the training program, It was able to run (and it has been running for at least 11 hours.)

I’m betting I just got lucky, and I’ll be searching for cloud compute resources for the future.

@PeterAJansen
Copy link
Author

@SouLeo I'm working with a Titan RTX with 24gb of memory, and was getting the error even with batch sizes of 1, so I don't think it was an out-of-memory issue in my case -- in case that helps you figure out what the issue ultimately was.

@kolbytn
Copy link

kolbytn commented Oct 7, 2020

Potential Fix

I was running into the same issue. Ubuntu 18.04, Cuda 10.2, Titan RTX 24GB. I followed the quick install instructions. Error happened almost immediately. Smaller batch sizes did'nt help. Running without --gpu worked.
Command:
CUDA_VISIBLE_DEVICES=1 python models/train/train_seq2seq.py --data data/json_feat_2.1.0 --model seq2seq_im_mask --dout exp/model:{model},name:pm_and_subgo als_01 --splits data/splits/oct21.json --gpu --batch 2 --pm_aux_loss_wt 0.1 --subgoal_aux_loss_wt 0.1 --preprocess
Output:

Namespace(action_loss_wt=1.0, actor_dropout=0.0, attn_dropout=0.0, batch=8, data='data/json_feat_2.1.0', dataset_fraction=0, dec_teacher_forcing=False, decay_epoch=10, demb=100, dframe=2500, dhid=512, dout='exp/model:seq2seq_im_mask,name:pm_and_subgoals_01', epoch=20, fast_epoch=False, gpu=True, hstate_dropout=0.3, input_dropout=0.0, lang_dropout=0.0, lr=0.0001, mask_loss_wt=1.0, model='seq2seq_im_mask', pframe=300, pm_aux_loss_wt=0.1, pp_folder='pp', preprocess=False, resume=None, save_every_epoch=False, seed=123, splits='data/splits/oct21.json', subgoal_aux_loss_wt=0.1, temp_no_history=False, vis_dropout=0.3, zero_goal=False, zero_instr=False)
{'tests_seen': 1533,
 'tests_unseen': 1529,
 'train': 21023,
 'valid_seen': 820,
 'valid_unseen': 821}
Traceback (most recent call last):
  File "models/train/train_seq2seq.py", line 103, in <module>
    model = model.to(torch.device('cuda'))
  File "/home/knotting/embodied/venv_alfred/lib/python3.6/site-packages/torch/nn/modules/module.py", line 386, in to
    return self._apply(convert)
  File "/home/knotting/embodied/venv_alfred/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
    module._apply(fn)
  File "/home/knotting/embodied/venv_alfred/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 127, in _apply
    self.flatten_parameters()
  File "/home/knotting/embodied/venv_alfred/lib/python3.6/site-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters
    self.batch_first, bool(self.bidirectional))
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

I uninstalled the versions of torch and torchvision specified in requirements.txt and instead installed latest versions. Everything seems to be working fine now. Is this a legitimate fix or will I run into issues using the latest pytorch with other parts of the repo?

@MohitShridhar
Copy link
Collaborator

Well... without --gpu you are training on CPU, which would be very slow.

@kolbytn
Copy link

kolbytn commented Oct 8, 2020

Sorry if I wasn't clear. I was stating that it does work while running on the cpu to point out that it is a cuda/gpu issue.

I fixed my issue by upgrading torch to the latest version instead of the version specified by requirements.txt. I want to know if there is another reason requirements.txt uses torch 1.1.0 and if anything will break if I use torch version 1.6.0.

@MohitShridhar
Copy link
Collaborator

MohitShridhar commented Oct 8, 2020

Yeah, I figure there might be some API updates in torch 1.6.0 that might break the code. Especially with GPU training.

@dnandha
Copy link

dnandha commented Aug 11, 2021

Getting the same error with the Docker image on RTX 2080. Could be that this card is not supported by torch==1.1.0?

@MohitShridhar
Copy link
Collaborator

@dnandha the seq2seq baselines are a bit outdated now. Checkout the SoTA models that use newer torch versions: https://github.com/askforalfred/alfred#sota-models

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants