Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVIDIA_DRIVER_CAPABILITIES=graphics is broken on Jetson devices (1.17.1 or later) #795

Open
yeongrokgim opened this issue Nov 13, 2024 · 4 comments
Assignees
Labels
bug Issue/PR to expose/discuss/fix a bug

Comments

@yeongrokgim
Copy link

yeongrokgim commented Nov 13, 2024

Summary

On Jetson(aarch64, Tegra SoC) devices, version 1.17.1 is not creating containers properly, if environment variable NVIDIA_DRIVER_CAPABILITIES contains any of display,graphics,all value.

This could be mitigated by overriding container env, for example docker run -e NVIDIA_DRIVER_CAPABILITIES=compute nvcr.io/....

Steps to reproduce

  1. Get a Jetson device. I tested with {Xavier, Orin} AGX DevKit as a reference.

  2. Install Docker runtime and nvidia-container-runtime=1.17.1-1

  3. Ensure nvidia container runtime has configured. To configure, run
    sudo nvidia-ctk runtime configure --set-as-default

  4. Try running a container. For example, l4t-base image could be used. For example:

    docker run -it --rm \
        -e NVIDIA_DRIVER_CAPABILITIES=all \
        nvcr.io/nvidia/l4t-base:r36.2.0

    OR, even with non-jetson base images:

    docker run -it --rm \
        -e NVIDIA_DRIVER_CAPABILITIES=display \
        -e NVIDIA_VISIBLE_DEVICES=all \
        ubuntu:22.04

Result

Example of error message

$ docker run -it --rm -e NVIDIA_DRIVER_CAPABILITIES=display -e NVIDIA_VISIBLE_DEVICES=all ubuntu:22.04

docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #1: error running hook: exit status 1, stdout: , stderr: time="2024-11-13T17:38:55+09:00" level=info msg="Symlinking /var/lib/docker/overlay2/8af1b1d84ee57db598be489bb9ad58fb2d139b77604aead77526787d18a02900/merged/etc/vulkan/icd.d/nvidia_icd.json to /usr/lib/aarch64-linux-gnu/tegra/nvidia_icd.json"
time="2024-11-13T17:38:55+09:00" level=error msg="failed to create link [/usr/lib/aarch64-linux-gnu/tegra/nvidia_icd.json /etc/vulkan/icd.d/nvidia_icd.json]: failed to create symlink: failed to remove existing file: remove /var/lib/docker/overlay2/8af1b1d84ee57db598be489bb9ad58fb2d139b77604aead77526787d18a02900/merged/etc/vulkan/icd.d/nvidia_icd.json: device or resource busy": unknown.
Hardware Jetpack nvidia-container-toolkit NVIDIA_DRIVER_CAPABILITIES result
Orin AGX 6.1 1.14.2 all Good
Orin AGX 6.1 1.17.1 all Error
Orin AGX 6.1 1.17.1 compute,utility Good
Orin AGX 6.1 1.17.1 display Error
Orin AGX 6.1 1.17.1 graphics Error
Xavier AGX 5.1.2 1.16.1 all Good
Xavier AGX 5.1.2 1.16.1 graphics Good
Xavier AGX 5.1.2 1.17.1 all Error
Xavier AGX 5.1.2 1.17.1 compute Good
Xavier AGX 5.1.2 1.17.1 display Error
Xavier AGX 5.1.2 1.17.1 graphics Error
@robcowie
Copy link

robcowie commented Nov 13, 2024

I can confirm this behaviour on the following additional env:

Hardware Jetpack nvidia-container-toolkit NVIDIA_DRIVER_CAPABILITIES result
Orin AGX 5.1 (l4t 35.2.1) 1.17.1 all Error
Orin AGX 5.1 (l4t 35.2.1) 1.16.2 all Good

Both on Ubuntu 20.04, docker version 27.3.1

The failing symlink happens to be the first sym declaration in /etc/nvidia-container-runtime/host-files-for-container.d/l4t.csv. Removing it causes the container run to fail at the next symlink, suggesting it is not that specific file at fault but something more fundamental.

I suspect that somewhere in v1.16.2...v1.17.1 is a change to the handling of symlinks that has broken the functionality.

@YasharSL
Copy link

YasharSL commented Nov 18, 2024

Facing the same issue with:

Hardware Jetpack l4t nvidia-container-toolkit
Orin AGX 5.1.1 35.3.1 1.17.2

Also, I think it's worth mentioning that I have both CUDA 11.8 and 11.4 on my Jetson. When I try to run nvcr.io/nvidia/pytorch:22.12-py3 image with CUDA 11.8 support with nvidia container toolkit runtime, it works fine but my other images which used to work with the previous versions including CUDA 11.4 are showing this same error

Temporary Fix

Downgraded container toolkit to 1.16.2 with the following steps

sudo apt purge nvidia-container-toolkit

sudo apt-get install -y --allow-downgrades nvidia-container-toolkit-base=1.16.2-1

sudo apt-get install -y --allow-downgrades nvidia-container-toolkit=1.16.2-1

@yeongrokgim yeongrokgim changed the title 1.17.1 - NVIDIA_DRIVER_CAPABILITIES=graphics is broken on Jetson devices NVIDIA_DRIVER_CAPABILITIES=graphics is broken on Jetson devices (1.17.1 or later) Nov 19, 2024
@mcasasola
Copy link

I am also experiencing this issue on my Jetson device. Here are the details of my setup:

Hardware: Jetson Orin 16GB
JetPack Version: 5.1.1 (L4T 35.3.1)
NVIDIA Container Toolkit Version: 1.17.2-1
When I attempt to run a container using the NVIDIA runtime, I receive the following error message:

sudo docker run --rm --runtime=nvidia shibenyong/devicequery ./deviceQuery
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #1: error running hook: exit status 1, stdout: , stderr: time="2024-11-21T16:37:37-03:00" level=info msg="Symlinking /mnt/storage/docker/overlay2/9289f0d60214918d874fb047d047dd9b8fa01f89d8332a26c25ba071a9af599d/merged/etc/vulkan/icd.d/nvidia_icd.json to /usr/lib/aarch64-linux-gnu/tegra/nvidia_icd.json"
time="2024-11-21T16:37:37-03:00" level=error msg="failed to create link [/usr/lib/aarch64-linux-gnu/tegra/nvidia_icd.json /etc/vulkan/icd.d/nvidia_icd.json]: failed to create symlink: failed to remove existing file: remove /mnt/storage/docker/overlay2/9289f0d60214918d874fb047d047dd9b8fa01f89d8332a26c25ba071a9af599d/merged/etc/vulkan/icd.d/nvidia_icd.json: device or resource busy": unknown.

After downgrading to nvidia-container-toolkit version 1.15.0-1, the container runs successfully:


sudo docker run --rm --runtime=nvidia shibenyong/devicequery ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Orin"
  CUDA Driver Version / Runtime Version          11.4 / 10.2
  CUDA Capability Major/Minor version number:    8.7
  Total amount of global memory:                 15389 MBytes (16136331264 bytes)
  (rest of the output)
Result = PASS

To resolve the issue, I only needed to purge nvidia-container-toolkit-base and nvidia-container-toolkit, and install version 1.15.0-1 of both. Here are the steps I followed:

sudo apt-get remove --purge nvidia-container-toolkit nvidia-container-toolkit-base
sudo apt-get install nvidia-container-toolkit=1.15.0-1 nvidia-container-toolkit-base=1.15.0-1

After downgrading, the containers are running correctly using the NVIDIA runtime.

Summary of my findings:

Hardware: Jetson Orin 16GB
JetPack Version: 5.1.1
NVIDIA Container Toolkit:
1.17.2-1: Error when running containers with the NVIDIA runtime
1.15.0-1: Works correctly when running containers with the NVIDIA runtime
It appears that the issue persists in version 1.17.2-1 on JetPack 5.1.1. Downgrading to an earlier version of the NVIDIA Container Toolkit resolves the problem. Note that it's sufficient to downgrade only nvidia-container-toolkit and nvidia-container-toolkit-base to version 1.15.0-1; there's no need to purge or downgrade other NVIDIA packages.

I hope this information helps in identifying and fixing the bug.

@elezar elezar self-assigned this Nov 25, 2024
@elezar elezar added the bug Issue/PR to expose/discuss/fix a bug label Nov 25, 2024
@Chao-Yao
Copy link

Chao-Yao commented Dec 8, 2024

Facing the same issue.
Hardware: Jetson Orin NX
JetPack 5.1.1
nvidia-container-toolkit 1.17.2-1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue/PR to expose/discuss/fix a bug
Projects
None yet
Development

No branches or pull requests

6 participants