Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #58

Open
jt551 opened this issue Feb 14, 2024 · 5 comments
Open

cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #58

jt551 opened this issue Feb 14, 2024 · 5 comments

Comments

@jt551
Copy link

jt551 commented Feb 14, 2024

Hello,
I'm trying to run the sample notebook on a new laptop with Ubuntu 20.04, RTX2000 GPU, and nvidia-driver-535.
When trying to execute following section in samples.ipynb

Networks prediction for the segmentation

I get following error in the notebook immediately with model():

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-6-d742f71e52a2> in <module>
     11         # We rotate first the image
     12         rot_image = rot(image, 'tensor', forward)
---> 13         pred = model(rot_image)
     14         # We rotate prediction back
     15         pred = rot(pred, 'tensor', back)

~/miniconda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    487             result = self._slow_forward(*input, **kwargs)
    488         else:
--> 489             result = self.forward(*input, **kwargs)
    490         for hook in self._forward_hooks.values():
    491             hook_result = hook(self, input, result)

/app/floortrans/models/hg_furukawa_original.py in forward(self, x)
    134 
    135     def forward(self, x):
--> 136         out = self.conv1_(x)
    137         out = self.bn1(out)
    138         out = self.relu1(out)

~/miniconda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    487             result = self._slow_forward(*input, **kwargs)
    488         else:
--> 489             result = self.forward(*input, **kwargs)
    490         for hook in self._forward_hooks.values():
    491             hook_result = hook(self, input, result)

~/miniconda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/conv.py in forward(self, input)
    318     def forward(self, input):
    319         return F.conv2d(input, self.weight, self.bias, self.stride,
--> 320                         self.padding, self.dilation, self.groups)
    321 
    322 

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Terminal running docker shows:

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1544174967633/work/aten/src/THC/THCGeneral.cpp line=405 error=8 : invalid device function

Could I get help to resolve this issue,
Thank you!

@tippo00
Copy link

tippo00 commented Mar 27, 2024

Hi,
I got the same error as you when trying to run samples.ipynb and eval.py. Have you found a solution?
Best regards

@jt551
Copy link
Author

jt551 commented Apr 2, 2024

No solution,
works on Paperspace with older P4000 GPU as Pascal architecture is supported by cuDNN 7.6.5 (CUDA 9).
https://docs.nvidia.com/deeplearning/cudnn/archives/cudnn-824/support-matrix/#cudnn-versions-764-765

https://docs.nvidia.com/cuda/ada-compatibility-guide/ suggested to try running with
CUDA_FORCE_PTX_JIT=1
this produced the same error.

@tippo00
Copy link

tippo00 commented Apr 10, 2024

Hi,
I got the sample notebook to work by running it on a newer version of CUDA (11.8) on my RTX 4070. I did this by first changing the docker file to:

FROM anibali/pytorch:2.0.1-cuda11.8-ubuntu22.04

# RUN sudo apt-get update
# RUN sudo apt-get upgrade -y
# RUN sudo apt-get install -y \
#         build-essential 

RUN sudo apt-get update \
 && sudo apt-get install -y libgl1-mesa-glx libgtk2.0-0 libsm6 libxext6 \
 && sudo rm -rf /var/lib/apt/lists/*

COPY requirements.txt /app/.

RUN pip install -r requirements.txt

And then changing the requirements.txt by removing the forced versions on all packages, adding opencv-python, and removing mkl-fft and mkl-random.

This lead to the error ValueError: A colormap named "rooms_furu" is already registered. in /floortrans/plotting.py which I fixed by changing line 610 in plotting.py to cmap3 = colors.ListedColormap(cpool, 'rooms_furu2').

I can now run the entirety of samples.ibynb without any errors, but I now get a different error when running eval.py.

$ python eval.py --weights model_best_val_loss_var.pkl
Traceback (most recent call last):                                              
  File "/app/eval.py", line 109, in <module>
    evaluate(args, log_dir, writer, logger)
  File "/app/eval.py", line 67, in evaluate
    things = get_evaluation_tensors(val, model, split, logger, rotate=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/floortrans/metrics.py", line 176, in get_evaluation_tensors
    predicted_classes = polygons_to_tensor(
                        ^^^^^^^^^^^^^^^^^^^
  File "/app/floortrans/metrics.py", line 127, in polygons_to_tensor
    ten[pol_type['class'] + d][jj, ii] = 1
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^
IndexError: index 521 is out of bounds for axis 0 with size 521

@foojayyy
Copy link

foojayyy commented Jun 3, 2024

Have you resolved this issue?thank you!

Hi, I got the sample notebook to work by running it on a newer version of CUDA (11.8) on my RTX 4070. I did this by first changing the docker file to:

FROM anibali/pytorch:2.0.1-cuda11.8-ubuntu22.04

# RUN sudo apt-get update
# RUN sudo apt-get upgrade -y
# RUN sudo apt-get install -y \
#         build-essential 

RUN sudo apt-get update \
 && sudo apt-get install -y libgl1-mesa-glx libgtk2.0-0 libsm6 libxext6 \
 && sudo rm -rf /var/lib/apt/lists/*

COPY requirements.txt /app/.

RUN pip install -r requirements.txt

And then changing the requirements.txt by removing the forced versions on all packages, adding opencv-python, and removing mkl-fft and mkl-random.

This lead to the error ValueError: A colormap named "rooms_furu" is already registered. in /floortrans/plotting.py which I fixed by changing line 610 in plotting.py to cmap3 = colors.ListedColormap(cpool, 'rooms_furu2').

I can now run the entirety of samples.ibynb without any errors, but I now get a different error when running eval.py.

$ python eval.py --weights model_best_val_loss_var.pkl
Traceback (most recent call last):                                              
  File "/app/eval.py", line 109, in <module>
    evaluate(args, log_dir, writer, logger)
  File "/app/eval.py", line 67, in evaluate
    things = get_evaluation_tensors(val, model, split, logger, rotate=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/floortrans/metrics.py", line 176, in get_evaluation_tensors
    predicted_classes = polygons_to_tensor(
                        ^^^^^^^^^^^^^^^^^^^
  File "/app/floortrans/metrics.py", line 127, in polygons_to_tensor
    ten[pol_type['class'] + d][jj, ii] = 1
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^
IndexError: index 521 is out of bounds for axis 0 with size 521

@tippo00
Copy link

tippo00 commented Jun 10, 2024

Yes, I got it to work by changing 4 lines in floortrans/post_prosessing.py.
From:

        polygon[:, 0] = np.clip(polygon[:, 0], 0, max_width)
        polygon[:, 1] = np.clip(polygon[:, 1], 0, max_height)

To:

        polygon[:, 0] = np.clip(polygon[:, 0], 0, max_width-1)
        polygon[:, 1] = np.clip(polygon[:, 1], 0, max_height-1)

And I did this change in two places. The first one around line 925 and the second around line 981. Hope this helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants