cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #58

jt551 · 2024-02-14T14:52:55Z

Hello,
I'm trying to run the sample notebook on a new laptop with Ubuntu 20.04, RTX2000 GPU, and nvidia-driver-535.
When trying to execute following section in samples.ipynb

Networks prediction for the segmentation

I get following error in the notebook immediately with model():

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-6-d742f71e52a2> in <module>
     11         # We rotate first the image
     12         rot_image = rot(image, 'tensor', forward)
---> 13         pred = model(rot_image)
     14         # We rotate prediction back
     15         pred = rot(pred, 'tensor', back)

~/miniconda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    487             result = self._slow_forward(*input, **kwargs)
    488         else:
--> 489             result = self.forward(*input, **kwargs)
    490         for hook in self._forward_hooks.values():
    491             hook_result = hook(self, input, result)

/app/floortrans/models/hg_furukawa_original.py in forward(self, x)
    134 
    135     def forward(self, x):
--> 136         out = self.conv1_(x)
    137         out = self.bn1(out)
    138         out = self.relu1(out)

~/miniconda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    487             result = self._slow_forward(*input, **kwargs)
    488         else:
--> 489             result = self.forward(*input, **kwargs)
    490         for hook in self._forward_hooks.values():
    491             hook_result = hook(self, input, result)

~/miniconda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/conv.py in forward(self, input)
    318     def forward(self, input):
    319         return F.conv2d(input, self.weight, self.bias, self.stride,
--> 320                         self.padding, self.dilation, self.groups)
    321 
    322 

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Terminal running docker shows:

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1544174967633/work/aten/src/THC/THCGeneral.cpp line=405 error=8 : invalid device function

Could I get help to resolve this issue,
Thank you!

tippo00 · 2024-03-27T09:14:46Z

Hi,
I got the same error as you when trying to run samples.ipynb and eval.py. Have you found a solution?
Best regards

jt551 · 2024-04-02T09:05:37Z

No solution,
works on Paperspace with older P4000 GPU as Pascal architecture is supported by cuDNN 7.6.5 (CUDA 9).
https://docs.nvidia.com/deeplearning/cudnn/archives/cudnn-824/support-matrix/#cudnn-versions-764-765

https://docs.nvidia.com/cuda/ada-compatibility-guide/ suggested to try running with
CUDA_FORCE_PTX_JIT=1
this produced the same error.

tippo00 · 2024-04-10T08:52:30Z

Hi,
I got the sample notebook to work by running it on a newer version of CUDA (11.8) on my RTX 4070. I did this by first changing the docker file to:

FROM anibali/pytorch:2.0.1-cuda11.8-ubuntu22.04

# RUN sudo apt-get update
# RUN sudo apt-get upgrade -y
# RUN sudo apt-get install -y \
#         build-essential 

RUN sudo apt-get update \
 && sudo apt-get install -y libgl1-mesa-glx libgtk2.0-0 libsm6 libxext6 \
 && sudo rm -rf /var/lib/apt/lists/*

COPY requirements.txt /app/.

RUN pip install -r requirements.txt

And then changing the requirements.txt by removing the forced versions on all packages, adding opencv-python, and removing mkl-fft and mkl-random.

This lead to the error ValueError: A colormap named "rooms_furu" is already registered. in /floortrans/plotting.py which I fixed by changing line 610 in plotting.py to cmap3 = colors.ListedColormap(cpool, 'rooms_furu2').

I can now run the entirety of samples.ibynb without any errors, but I now get a different error when running eval.py.

$ python eval.py --weights model_best_val_loss_var.pkl
Traceback (most recent call last):                                              
  File "/app/eval.py", line 109, in <module>
    evaluate(args, log_dir, writer, logger)
  File "/app/eval.py", line 67, in evaluate
    things = get_evaluation_tensors(val, model, split, logger, rotate=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/floortrans/metrics.py", line 176, in get_evaluation_tensors
    predicted_classes = polygons_to_tensor(
                        ^^^^^^^^^^^^^^^^^^^
  File "/app/floortrans/metrics.py", line 127, in polygons_to_tensor
    ten[pol_type['class'] + d][jj, ii] = 1
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^
IndexError: index 521 is out of bounds for axis 0 with size 521

foojayyy · 2024-06-03T14:12:36Z

Have you resolved this issue？thank you！

Hi, I got the sample notebook to work by running it on a newer version of CUDA (11.8) on my RTX 4070. I did this by first changing the docker file to:
FROM anibali/pytorch:2.0.1-cuda11.8-ubuntu22.04

# RUN sudo apt-get update
# RUN sudo apt-get upgrade -y
# RUN sudo apt-get install -y \
#         build-essential 

RUN sudo apt-get update \
 && sudo apt-get install -y libgl1-mesa-glx libgtk2.0-0 libsm6 libxext6 \
 && sudo rm -rf /var/lib/apt/lists/*

COPY requirements.txt /app/.

RUN pip install -r requirements.txt
And then changing the requirements.txt by removing the forced versions on all packages, adding opencv-python, and removing mkl-fft and mkl-random.

This lead to the error ValueError: A colormap named "rooms_furu" is already registered. in /floortrans/plotting.py which I fixed by changing line 610 in plotting.py to cmap3 = colors.ListedColormap(cpool, 'rooms_furu2').

I can now run the entirety of samples.ibynb without any errors, but I now get a different error when running eval.py.
$ python eval.py --weights model_best_val_loss_var.pkl
Traceback (most recent call last):                                              
  File "/app/eval.py", line 109, in <module>
    evaluate(args, log_dir, writer, logger)
  File "/app/eval.py", line 67, in evaluate
    things = get_evaluation_tensors(val, model, split, logger, rotate=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/floortrans/metrics.py", line 176, in get_evaluation_tensors
    predicted_classes = polygons_to_tensor(
                        ^^^^^^^^^^^^^^^^^^^
  File "/app/floortrans/metrics.py", line 127, in polygons_to_tensor
    ten[pol_type['class'] + d][jj, ii] = 1
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^
IndexError: index 521 is out of bounds for axis 0 with size 521

tippo00 · 2024-06-10T08:25:10Z

Yes, I got it to work by changing 4 lines in floortrans/post_prosessing.py.
From:

        polygon[:, 0] = np.clip(polygon[:, 0], 0, max_width)
        polygon[:, 1] = np.clip(polygon[:, 1], 0, max_height)

To:

        polygon[:, 0] = np.clip(polygon[:, 0], 0, max_width-1)
        polygon[:, 1] = np.clip(polygon[:, 1], 0, max_height-1)

And I did this change in two places. The first one around line 925 and the second around line 981. Hope this helps!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #58

cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #58

jt551 commented Feb 14, 2024 •

edited

Loading

tippo00 commented Mar 27, 2024

jt551 commented Apr 2, 2024

tippo00 commented Apr 10, 2024

foojayyy commented Jun 3, 2024

tippo00 commented Jun 10, 2024

cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #58

cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #58

Comments

jt551 commented Feb 14, 2024 • edited Loading

Networks prediction for the segmentation

tippo00 commented Mar 27, 2024

jt551 commented Apr 2, 2024

tippo00 commented Apr 10, 2024

foojayyy commented Jun 3, 2024

tippo00 commented Jun 10, 2024

jt551 commented Feb 14, 2024 •

edited

Loading