Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chroot: failed to run command 'nvidia-smi': No such file or directory #1063

Open
vanloswang opened this issue Oct 24, 2024 · 1 comment
Open

Comments

@vanloswang
Copy link

OS environment information

# cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

# uname -a
Linux a100 5.15.0-107-generic #117~20.04.1-Ubuntu SMP Tue Apr 30 10:35:57 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

###GPU environment information

# dpkg -l | grep nvidia
ii  libnvidia-cfg1-560:amd64                      560.35.03-0ubuntu1                   amd64        NVIDIA binary OpenGL/GLX configuration library
ii  libnvidia-common-560                          560.35.03-0ubuntu1                   all          Shared files used by the NVIDIA libraries
rc  libnvidia-compute-515:amd64                   515.65.01-0ubuntu1                   amd64        NVIDIA libcompute package
rc  libnvidia-compute-525:amd64                   525.147.05-0ubuntu2.20.04.1          amd64        NVIDIA libcompute package (transitional package)
rc  libnvidia-compute-535:amd64                   535.171.04-0ubuntu0.20.04.1          amd64        NVIDIA libcompute package
ii  libnvidia-compute-560:amd64                   560.35.03-0ubuntu1                   amd64        NVIDIA libcompute package
ii  libnvidia-container-tools                     1.16.1-1                             amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64                    1.16.1-1                             amd64        NVIDIA container runtime library
ii  libnvidia-decode-560:amd64                    560.35.03-0ubuntu1                   amd64        NVIDIA Video Decoding runtime libraries
ii  libnvidia-encode-560:amd64                    560.35.03-0ubuntu1                   amd64        NVENC Video Encoding runtime library
ii  libnvidia-extra-560:amd64                     560.35.03-0ubuntu1                   amd64        Extra libraries for the NVIDIA driver
ii  libnvidia-fbc1-560:amd64                      560.35.03-0ubuntu1                   amd64        NVIDIA OpenGL-based Framebuffer Capture runtime library
ii  libnvidia-gl-560:amd64                        560.35.03-0ubuntu1                   amd64        NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
rc  nvidia-compute-utils-535                      535.171.04-0ubuntu0.20.04.1          amd64        NVIDIA compute utilities
ii  nvidia-compute-utils-560                      560.35.03-0ubuntu1                   amd64        NVIDIA compute utilities
ii  nvidia-container-runtime                      3.14.0-1                             all          NVIDIA Container Toolkit meta-package
ii  nvidia-container-toolkit                      1.16.1-1                             amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base                 1.16.1-1                             amd64        NVIDIA Container Toolkit Base
rc  nvidia-dkms-535                               535.171.04-0ubuntu0.20.04.1          amd64        NVIDIA DKMS package
ii  nvidia-dkms-560                               560.35.03-0ubuntu1                   amd64        NVIDIA DKMS package
ii  nvidia-docker2                                2.14.0-1                             all          NVIDIA Container Toolkit meta-package
ii  nvidia-driver-560                             560.35.03-0ubuntu1                   amd64        NVIDIA driver metapackage
ii  nvidia-driver-local-repo-ubuntu2004-560.35.03 1.0-1                                amd64        nvidia-driver-local repository configuration files
ii  nvidia-firmware-535-535.171.04                535.171.04-0ubuntu0.20.04.1          amd64        Firmware files used by the kernel module
ii  nvidia-firmware-560-560.35.03                 560.35.03-0ubuntu1                   amd64        Firmware files used by the kernel module
rc  nvidia-kernel-common-535                      535.171.04-0ubuntu0.20.04.1          amd64        Shared files used with the kernel module
ii  nvidia-kernel-common-560                      560.35.03-0ubuntu1                   amd64        Shared files used with the kernel module
ii  nvidia-kernel-source-560                      560.35.03-0ubuntu1                   amd64        NVIDIA kernel source package
ii  nvidia-prime                                  0.8.16~0.20.04.2                     all          Tools to enable NVIDIA's Prime
ii  nvidia-settings                               515.65.01-0ubuntu1                   amd64        Tool for configuring the NVIDIA graphics driver
ii  nvidia-utils-560                              560.35.03-0ubuntu1                   amd64        NVIDIA driver support binaries
ii  screen-resolution-extra                       0.18build1                           all          Extension for the nvidia-settings control panel
ii  xserver-xorg-video-nvidia-560                 560.35.03-0ubuntu1                   amd64        NVIDIA binary Xorg driver

# nvidia-smi
Thu Oct 24 09:19:00 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-PCIE-40GB          Off |   00000000:41:00.0 Off |                    0 |
| N/A   32C    P0             36W /  250W |    3028MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-PCIE-40GB          Off |   00000000:C1:00.0 Off |                    0 |
| N/A   34C    P0             36W /  250W |      17MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      2221      G   /usr/lib/xorg/Xorg                              4MiB |
|    0   N/A  N/A      5347      C   python                                        998MiB |
|    0   N/A  N/A      6150      C   /opt/conda/bin/python                         998MiB |
|    0   N/A  N/A      6151      C   /opt/conda/bin/python                         998MiB |
|    1   N/A  N/A      2221      G   /usr/lib/xorg/Xorg                              4MiB |
+-----------------------------------------------------------------------------------------+

the installtion step of the GPU Operator

# kubectl create ns gpu-operator
namespace/gpu-operator created

# helm install --kubeconfig=/var/lib/secctr/k3s/server/cred/admin.kubeconfig gpu-operator -n gpu-operator . --values values.yaml
NAME: gpu-operator
LAST DEPLOYED: Thu Oct 24 09:19:31 2024
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None

# helm list -n gpu-operator
NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
gpu-operator    gpu-operator    1               2024-10-24 09:19:31.372209469 +0800 CST deployed        gpu-operator-v24.6.2    v24.6.2

# kubectl get all -n gpu-operator
NAME                                                              READY   STATUS     RESTARTS   AGE
pod/gpu-feature-discovery-z4v8q                                   0/1     Init:0/1   0          89s
pod/gpu-operator-7d66589d9b-rkqrm                                 1/1     Running    0          93s
pod/gpu-operator-node-feature-discovery-gc-7478549676-zzr66       1/1     Running    0          93s
pod/gpu-operator-node-feature-discovery-master-67769784f5-7pqvj   1/1     Running    0          93s
pod/gpu-operator-node-feature-discovery-worker-6bwfw              1/1     Running    0          93s
pod/nvidia-container-toolkit-daemonset-lkqz9                      0/1     Init:0/1   0          90s
pod/nvidia-dcgm-exporter-lxjqv                                    0/1     Init:0/1   0          89s
pod/nvidia-device-plugin-daemonset-5tncc                          0/1     Init:0/1   0          89s
pod/nvidia-operator-validator-jctzl                               0/1     Init:0/4   0          90s

NAME                           TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/gpu-operator           ClusterIP   10.43.21.239    <none>        8080/TCP   91s
service/nvidia-dcgm-exporter   ClusterIP   10.43.126.172   <none>        9400/TCP   90s

NAME                                                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                          AGE
daemonset.apps/gpu-feature-discovery                        1         1         0       1            0           nvidia.com/gpu.deploy.gpu-feature-discovery=true                       89s
daemonset.apps/gpu-operator-node-feature-discovery-worker   1         1         1       1            1           <none>                                                                 93s
daemonset.apps/nvidia-container-toolkit-daemonset           1         1         0       1            0           nvidia.com/gpu.deploy.container-toolkit=true                           90s
daemonset.apps/nvidia-dcgm-exporter                         1         1         0       1            0           nvidia.com/gpu.deploy.dcgm-exporter=true                               89s
daemonset.apps/nvidia-device-plugin-daemonset               1         1         0       1            0           nvidia.com/gpu.deploy.device-plugin=true                               90s
daemonset.apps/nvidia-device-plugin-mps-control-daemon      0         0         0       0            0           nvidia.com/gpu.deploy.device-plugin=true,nvidia.com/mps.capable=true   89s
daemonset.apps/nvidia-driver-daemonset                      0         0         0       0            0           nvidia.com/gpu.deploy.driver=true                                      90s
daemonset.apps/nvidia-mig-manager                           0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true                                 89s
daemonset.apps/nvidia-operator-validator                    1         1         0       1            0           nvidia.com/gpu.deploy.operator-validator=true                          90s

NAME                                                         READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/gpu-operator                                 1/1     1            1           93s
deployment.apps/gpu-operator-node-feature-discovery-gc       1/1     1            1           93s
deployment.apps/gpu-operator-node-feature-discovery-master   1/1     1            1           93s

NAME                                                                    DESIRED   CURRENT   READY   AGE
replicaset.apps/gpu-operator-7d66589d9b                                 1         1         1       93s
replicaset.apps/gpu-operator-node-feature-discovery-gc-7478549676       1         1         1       93s
replicaset.apps/gpu-operator-node-feature-discovery-master-67769784f5   1         1         1       93s

# crictl image ls | grep -e nvidia -e vgpu -e nfd
nvcr.io/nvidia/cloud-native/dcgm                                3.3.7-1-ubuntu22.04            292733d61a20b       1.99GB
nvcr.io/nvidia/cloud-native/gpu-operator-validator              latest                         8371f914ffba3       324MB
nvcr.io/nvidia/cloud-native/gpu-operator-validator              v24.6.2                        8371f914ffba3       324MB
nvcr.io/nvidia/cloud-native/k8s-cc-manager                      v0.1.1                         c5006389d56b3       647MB
nvcr.io/nvidia/cloud-native/k8s-driver-manager                  v0.6.10                        dd9cff3ea5509       590MB
nvcr.io/nvidia/cloud-native/k8s-kata-manager                    v0.2.1                         082572359e199       449MB
nvcr.io/nvidia/cloud-native/vgpu-device-manager                 v0.2.7                         9f7a380f3f3e0       419MB
nvcr.io/nvidia/cuda                                             12.6.1-base-ubi8               103c9a2598a96       389MB
nvcr.io/nvidia/driver                                           550.90.07                      62f8c7903995a       1.17GB
nvcr.io/nvidia/gpu-operator                                     v24.6.2                        d57aeeb1c5a37       623MB
nvcr.io/nvidia/k8s-device-plugin                                v0.16.2-ubi8                   44edb05883259       505MB
nvcr.io/nvidia/k8s/container-toolkit                            v1.16.2-ubuntu20.04            bdcc66b183991       350MB
nvcr.io/nvidia/k8s/dcgm-exporter                                3.3.7-3.5.0-ubuntu22.04        ee8c6dfbf28aa       350MB
nvcr.io/nvidia/kubevirt-gpu-device-plugin                       v1.2.9                         3b4407d30d0d6       415MB
registry.k8s.io/nfd/node-feature-discovery                      v0.16.3                        bc292d823f05c       226MB

error information of driver-validation pod

2024-10-24T09:25:48.176690024+08:00 running command chroot with args [/run/nvidia/driver nvidia-smi]
2024-10-24T09:25:48.178993123+08:00 chroot: failed to run command 'nvidia-smi': No such file or directory
2024-10-24T09:25:48.179246578+08:00 command failed, retrying after 5 seconds
2024-10-24T09:25:53.179544295+08:00 running command chroot with args [/run/nvidia/driver nvidia-smi]
2024-10-24T09:25:53.181845219+08:00 chroot: failed to run command 'nvidia-smi': No such file or directory
2024-10-24T09:25:53.182124683+08:00 command failed, retrying after 5 seconds
2024-10-24T09:25:58.182391151+08:00 running command chroot with args [/run/nvidia/driver nvidia-smi]
2024-10-24T09:25:58.184547885+08:00 chroot: failed to run command 'nvidia-smi': No such file or directory
2024-10-24T09:25:58.184862084+08:00 command failed, retrying after 5 seconds
2024-10-24T09:26:03.185091032+08:00 running command chroot with args [/run/nvidia/driver nvidia-smi]
2024-10-24T09:26:03.187436039+08:00 chroot: failed to run command 'nvidia-smi': No such file or directory
2024-10-24T09:26:03.187669887+08:00 command failed, retrying after 5 seconds
2024-10-24T09:26:08.187985477+08:00 running command chroot with args [/run/nvidia/driver nvidia-smi]
2024-10-24T09:26:08.190198327+08:00 chroot: failed to run command 'nvidia-smi': No such file or directory
2024-10-24T09:26:08.190459947+08:00 command failed, retrying after 5 seconds
2024-10-24T09:26:13.190717358+08:00 running command chroot with args [/run/nvidia/driver nvidia-smi]
2024-10-24T09:26:13.192869694+08:00 chroot: failed to run command 'nvidia-smi': No such file or directory
2024-10-24T09:26:13.193131254+08:00 command failed, retrying after 5 seconds

K8S event info

# kubectl get event -n gpu-operator
LAST SEEN   TYPE     REASON              OBJECT                                                             MESSAGE
12m         Normal   LeaderElection      lease/53822513.nvidia.com                                          gpu-operator-7d66589d9b-rkqrm_4bb2ab72-3969-4020-87e4-1704ded2e72d became leader
12m         Normal   Scheduled           pod/gpu-feature-discovery-z4v8q                                    Successfully assigned gpu-operator/gpu-feature-discovery-z4v8q to de9e0472.secctr.com
12m         Normal   Pulled              pod/gpu-feature-discovery-z4v8q                                    Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v24.6.2" already present on machine
12m         Normal   Created             pod/gpu-feature-discovery-z4v8q                                    Created container toolkit-validation
12m         Normal   Started             pod/gpu-feature-discovery-z4v8q                                    Started container toolkit-validation
12m         Normal   SuccessfulCreate    daemonset/gpu-feature-discovery                                    Created pod: gpu-feature-discovery-z4v8q
12m         Normal   Scheduled           pod/gpu-operator-7d66589d9b-rkqrm                                  Successfully assigned gpu-operator/gpu-operator-7d66589d9b-rkqrm to de9e0472.secctr.com
12m         Normal   Pulled              pod/gpu-operator-7d66589d9b-rkqrm                                  Container image "nvcr.io/nvidia/gpu-operator:v24.6.2" already present on machine
12m         Normal   Created             pod/gpu-operator-7d66589d9b-rkqrm                                  Created container gpu-operator
12m         Normal   Started             pod/gpu-operator-7d66589d9b-rkqrm                                  Started container gpu-operator
12m         Normal   SuccessfulCreate    replicaset/gpu-operator-7d66589d9b                                 Created pod: gpu-operator-7d66589d9b-rkqrm
12m         Normal   Scheduled           pod/gpu-operator-node-feature-discovery-gc-7478549676-zzr66        Successfully assigned gpu-operator/gpu-operator-node-feature-discovery-gc-7478549676-zzr66 to de9e0472.secctr.com
12m         Normal   Pulled              pod/gpu-operator-node-feature-discovery-gc-7478549676-zzr66        Container image "registry.k8s.io/nfd/node-feature-discovery:v0.16.3" already present on machine
12m         Normal   Created             pod/gpu-operator-node-feature-discovery-gc-7478549676-zzr66        Created container gc
12m         Normal   Started             pod/gpu-operator-node-feature-discovery-gc-7478549676-zzr66        Started container gc
12m         Normal   SuccessfulCreate    replicaset/gpu-operator-node-feature-discovery-gc-7478549676       Created pod: gpu-operator-node-feature-discovery-gc-7478549676-zzr66
12m         Normal   ScalingReplicaSet   deployment/gpu-operator-node-feature-discovery-gc                  Scaled up replica set gpu-operator-node-feature-discovery-gc-7478549676 to 1
12m         Normal   Scheduled           pod/gpu-operator-node-feature-discovery-master-67769784f5-7pqvj    Successfully assigned gpu-operator/gpu-operator-node-feature-discovery-master-67769784f5-7pqvj to de9e0472.secctr.com
12m         Normal   Pulled              pod/gpu-operator-node-feature-discovery-master-67769784f5-7pqvj    Container image "registry.k8s.io/nfd/node-feature-discovery:v0.16.3" already present on machine
12m         Normal   Created             pod/gpu-operator-node-feature-discovery-master-67769784f5-7pqvj    Created container master
12m         Normal   Started             pod/gpu-operator-node-feature-discovery-master-67769784f5-7pqvj    Started container master
12m         Normal   SuccessfulCreate    replicaset/gpu-operator-node-feature-discovery-master-67769784f5   Created pod: gpu-operator-node-feature-discovery-master-67769784f5-7pqvj
12m         Normal   ScalingReplicaSet   deployment/gpu-operator-node-feature-discovery-master              Scaled up replica set gpu-operator-node-feature-discovery-master-67769784f5 to 1
12m         Normal   Scheduled           pod/gpu-operator-node-feature-discovery-worker-6bwfw               Successfully assigned gpu-operator/gpu-operator-node-feature-discovery-worker-6bwfw to de9e0472.secctr.com
12m         Normal   Pulled              pod/gpu-operator-node-feature-discovery-worker-6bwfw               Container image "registry.k8s.io/nfd/node-feature-discovery:v0.16.3" already present on machine
12m         Normal   Created             pod/gpu-operator-node-feature-discovery-worker-6bwfw               Created container worker
12m         Normal   Started             pod/gpu-operator-node-feature-discovery-worker-6bwfw               Started container worker
12m         Normal   SuccessfulCreate    daemonset/gpu-operator-node-feature-discovery-worker               Created pod: gpu-operator-node-feature-discovery-worker-6bwfw
12m         Normal   ScalingReplicaSet   deployment/gpu-operator                                            Scaled up replica set gpu-operator-7d66589d9b to 1
12m         Normal   Scheduled           pod/nvidia-container-toolkit-daemonset-lkqz9                       Successfully assigned gpu-operator/nvidia-container-toolkit-daemonset-lkqz9 to de9e0472.secctr.com
12m         Normal   Pulled              pod/nvidia-container-toolkit-daemonset-lkqz9                       Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v24.6.2" already present on machine
12m         Normal   Created             pod/nvidia-container-toolkit-daemonset-lkqz9                       Created container driver-validation
12m         Normal   Started             pod/nvidia-container-toolkit-daemonset-lkqz9                       Started container driver-validation
12m         Normal   SuccessfulCreate    daemonset/nvidia-container-toolkit-daemonset                       Created pod: nvidia-container-toolkit-daemonset-lkqz9
12m         Normal   Scheduled           pod/nvidia-dcgm-exporter-lxjqv                                     Successfully assigned gpu-operator/nvidia-dcgm-exporter-lxjqv to de9e0472.secctr.com
12m         Normal   Pulled              pod/nvidia-dcgm-exporter-lxjqv                                     Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v24.6.2" already present on machine
12m         Normal   Created             pod/nvidia-dcgm-exporter-lxjqv                                     Created container toolkit-validation
12m         Normal   Started             pod/nvidia-dcgm-exporter-lxjqv                                     Started container toolkit-validation
12m         Normal   SuccessfulCreate    daemonset/nvidia-dcgm-exporter                                     Created pod: nvidia-dcgm-exporter-lxjqv
12m         Normal   Scheduled           pod/nvidia-device-plugin-daemonset-5tncc                           Successfully assigned gpu-operator/nvidia-device-plugin-daemonset-5tncc to de9e0472.secctr.com
12m         Normal   Pulled              pod/nvidia-device-plugin-daemonset-5tncc                           Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v24.6.2" already present on machine
12m         Normal   Created             pod/nvidia-device-plugin-daemonset-5tncc                           Created container toolkit-validation
12m         Normal   Started             pod/nvidia-device-plugin-daemonset-5tncc                           Started container toolkit-validation
12m         Normal   SuccessfulCreate    daemonset/nvidia-device-plugin-daemonset                           Created pod: nvidia-device-plugin-daemonset-5tncc
12m         Normal   Scheduled           pod/nvidia-driver-daemonset-vpvv6                                  Successfully assigned gpu-operator/nvidia-driver-daemonset-vpvv6 to de9e0472.secctr.com
12m         Normal   Pulled              pod/nvidia-driver-daemonset-vpvv6                                  Container image "nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.10" already present on machine
12m         Normal   Created             pod/nvidia-driver-daemonset-vpvv6                                  Created container k8s-driver-manager
12m         Normal   Started             pod/nvidia-driver-daemonset-vpvv6                                  Started container k8s-driver-manager
12m         Normal   Killing             pod/nvidia-driver-daemonset-vpvv6                                  Stopping container k8s-driver-manager
12m         Normal   SuccessfulCreate    daemonset/nvidia-driver-daemonset                                  Created pod: nvidia-driver-daemonset-vpvv6
12m         Normal   SuccessfulDelete    daemonset/nvidia-driver-daemonset                                  Deleted pod: nvidia-driver-daemonset-vpvv6
12m         Normal   Scheduled           pod/nvidia-operator-validator-jctzl                                Successfully assigned gpu-operator/nvidia-operator-validator-jctzl to de9e0472.secctr.com
12m         Normal   Pulled              pod/nvidia-operator-validator-jctzl                                Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v24.6.2" already present on machine
12m         Normal   Created             pod/nvidia-operator-validator-jctzl                                Created container driver-validation
12m         Normal   Started             pod/nvidia-operator-validator-jctzl                                Started container driver-validation
12m         Normal   SuccessfulCreate    daemonset/nvidia-operator-validator                                Created pod: nvidia-operator-validator-jctzl

infomation of /run/nvidia/

# ls /run/nvidia/
driver  mps  toolkit  validations
# ls /run/nvidia/driver/
lib
# ls /run/nvidia/driver/lib/
firmware
# ls /run/nvidia/driver/lib/firmware/
# ls /run/nvidia/mps/
# ls /run/nvidia/toolkit/
# ls /run/nvidia/validations/
#

question: how to setup /run/nvidia/driver/ with nvidia-smi and so on?

@bigstinky86
Copy link

bigstinky86 commented Dec 13, 2024

@vanloswang check that daemonset/nvidia-driver-daemonset pods are up and running. These pods do all the stuff for the driver compilation and installation. If the pods are not up and running check their logs and fix all the errors.

Successful driver compilation and installation log should end with the next rows:

Parsing kernel module parameters...
Loading ipmi and i2c_core kernel modules...
Loading NVIDIA driver kernel modules...
+ modprobe nvidia
+ modprobe nvidia-uvm
+ modprobe nvidia-modeset
+ set +o xtrace -o nounset
Starting NVIDIA persistence daemon...
Mounting NVIDIA driver rootfs...
Check SELinux status
SELinux is enabled
Change device files security context for selinux compatibility
Done, now waiting for signal

Once nvidia-driver-daemonset are up and running all the remaining pods related to the Nvidia should find all the executables and get up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants