You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, on one of my GPU servers, GPU containers using the nvidia container runtime fail to terminate due to permission issues, what could be the cause of this? They start up and run fine.
The error appears when trying to shutdown a container:
sudo ctr -n k8s.io task kill fddedcb271ff4df58b5e539fb246ca86700db730ecde0ae7c38be0d1c77d39e1
ctr: unknown error after kill: /usr/bin/nvidia-container-runtime did not terminate successfully: exit status 1: unable to signal init: permission denied
: unknown
Toolkit version is 1.17.1, containerd version 1.7.12.
Thanks much.
The text was updated successfully, but these errors were encountered:
I found the solution to the issue - for anybody stumbling over this thread with the same problem, I'll leave it here.
The issue is actually not with nvidia-container-runtime, but with a broken AppArmor profile which prevents runc from signaling a kill command to containers, as documented here:
A quick (and very dirty) workaround is to move the runc executable from /usr/sbin/runc to /usr/bin/runc, as it then runs without the broken AppArmor profile. All containers stuck on Terminating were killed immediately, and everything worked fine again.
Hi, on one of my GPU servers, GPU containers using the nvidia container runtime fail to terminate due to permission issues, what could be the cause of this? They start up and run fine.
The error appears when trying to shutdown a container:
Toolkit version is 1.17.1, containerd version 1.7.12.
Thanks much.
The text was updated successfully, but these errors were encountered: