Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-container-runtime unable to signal init: permission denied #796

Open
sense-amid-madness opened this issue Nov 13, 2024 · 1 comment

Comments

@sense-amid-madness
Copy link

Hi, on one of my GPU servers, GPU containers using the nvidia container runtime fail to terminate due to permission issues, what could be the cause of this? They start up and run fine.

The error appears when trying to shutdown a container:

sudo ctr -n k8s.io task kill fddedcb271ff4df58b5e539fb246ca86700db730ecde0ae7c38be0d1c77d39e1
ctr: unknown error after kill: /usr/bin/nvidia-container-runtime did not terminate successfully: exit status 1: unable to signal init: permission denied
: unknown

Toolkit version is 1.17.1, containerd version 1.7.12.

Thanks much.

@sense-amid-madness
Copy link
Author

sense-amid-madness commented Nov 13, 2024

I found the solution to the issue - for anybody stumbling over this thread with the same problem, I'll leave it here.

The issue is actually not with nvidia-container-runtime, but with a broken AppArmor profile which prevents runc from signaling a kill command to containers, as documented here:

moby/moby#47749

A quick (and very dirty) workaround is to move the runc executable from /usr/sbin/runc to /usr/bin/runc, as it then runs without the broken AppArmor profile. All containers stuck on Terminating were killed immediately, and everything worked fine again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant