You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
2024-10-24T09:25:48.176690024+08:00 running command chroot with args [/run/nvidia/driver nvidia-smi]
2024-10-24T09:25:48.178993123+08:00 chroot: failed to run command 'nvidia-smi': No such file or directory
2024-10-24T09:25:48.179246578+08:00 command failed, retrying after 5 seconds
2024-10-24T09:25:53.179544295+08:00 running command chroot with args [/run/nvidia/driver nvidia-smi]
2024-10-24T09:25:53.181845219+08:00 chroot: failed to run command 'nvidia-smi': No such file or directory
2024-10-24T09:25:53.182124683+08:00 command failed, retrying after 5 seconds
2024-10-24T09:25:58.182391151+08:00 running command chroot with args [/run/nvidia/driver nvidia-smi]
2024-10-24T09:25:58.184547885+08:00 chroot: failed to run command 'nvidia-smi': No such file or directory
2024-10-24T09:25:58.184862084+08:00 command failed, retrying after 5 seconds
2024-10-24T09:26:03.185091032+08:00 running command chroot with args [/run/nvidia/driver nvidia-smi]
2024-10-24T09:26:03.187436039+08:00 chroot: failed to run command 'nvidia-smi': No such file or directory
2024-10-24T09:26:03.187669887+08:00 command failed, retrying after 5 seconds
2024-10-24T09:26:08.187985477+08:00 running command chroot with args [/run/nvidia/driver nvidia-smi]
2024-10-24T09:26:08.190198327+08:00 chroot: failed to run command 'nvidia-smi': No such file or directory
2024-10-24T09:26:08.190459947+08:00 command failed, retrying after 5 seconds
2024-10-24T09:26:13.190717358+08:00 running command chroot with args [/run/nvidia/driver nvidia-smi]
2024-10-24T09:26:13.192869694+08:00 chroot: failed to run command 'nvidia-smi': No such file or directory
2024-10-24T09:26:13.193131254+08:00 command failed, retrying after 5 seconds
K8S event info
# kubectl get event -n gpu-operator
LAST SEEN TYPE REASON OBJECT MESSAGE
12m Normal LeaderElection lease/53822513.nvidia.com gpu-operator-7d66589d9b-rkqrm_4bb2ab72-3969-4020-87e4-1704ded2e72d became leader
12m Normal Scheduled pod/gpu-feature-discovery-z4v8q Successfully assigned gpu-operator/gpu-feature-discovery-z4v8q to de9e0472.secctr.com
12m Normal Pulled pod/gpu-feature-discovery-z4v8q Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v24.6.2" already present on machine
12m Normal Created pod/gpu-feature-discovery-z4v8q Created container toolkit-validation
12m Normal Started pod/gpu-feature-discovery-z4v8q Started container toolkit-validation
12m Normal SuccessfulCreate daemonset/gpu-feature-discovery Created pod: gpu-feature-discovery-z4v8q
12m Normal Scheduled pod/gpu-operator-7d66589d9b-rkqrm Successfully assigned gpu-operator/gpu-operator-7d66589d9b-rkqrm to de9e0472.secctr.com
12m Normal Pulled pod/gpu-operator-7d66589d9b-rkqrm Container image "nvcr.io/nvidia/gpu-operator:v24.6.2" already present on machine
12m Normal Created pod/gpu-operator-7d66589d9b-rkqrm Created container gpu-operator
12m Normal Started pod/gpu-operator-7d66589d9b-rkqrm Started container gpu-operator
12m Normal SuccessfulCreate replicaset/gpu-operator-7d66589d9b Created pod: gpu-operator-7d66589d9b-rkqrm
12m Normal Scheduled pod/gpu-operator-node-feature-discovery-gc-7478549676-zzr66 Successfully assigned gpu-operator/gpu-operator-node-feature-discovery-gc-7478549676-zzr66 to de9e0472.secctr.com
12m Normal Pulled pod/gpu-operator-node-feature-discovery-gc-7478549676-zzr66 Container image "registry.k8s.io/nfd/node-feature-discovery:v0.16.3" already present on machine
12m Normal Created pod/gpu-operator-node-feature-discovery-gc-7478549676-zzr66 Created container gc
12m Normal Started pod/gpu-operator-node-feature-discovery-gc-7478549676-zzr66 Started container gc
12m Normal SuccessfulCreate replicaset/gpu-operator-node-feature-discovery-gc-7478549676 Created pod: gpu-operator-node-feature-discovery-gc-7478549676-zzr66
12m Normal ScalingReplicaSet deployment/gpu-operator-node-feature-discovery-gc Scaled up replica set gpu-operator-node-feature-discovery-gc-7478549676 to 1
12m Normal Scheduled pod/gpu-operator-node-feature-discovery-master-67769784f5-7pqvj Successfully assigned gpu-operator/gpu-operator-node-feature-discovery-master-67769784f5-7pqvj to de9e0472.secctr.com
12m Normal Pulled pod/gpu-operator-node-feature-discovery-master-67769784f5-7pqvj Container image "registry.k8s.io/nfd/node-feature-discovery:v0.16.3" already present on machine
12m Normal Created pod/gpu-operator-node-feature-discovery-master-67769784f5-7pqvj Created container master
12m Normal Started pod/gpu-operator-node-feature-discovery-master-67769784f5-7pqvj Started container master
12m Normal SuccessfulCreate replicaset/gpu-operator-node-feature-discovery-master-67769784f5 Created pod: gpu-operator-node-feature-discovery-master-67769784f5-7pqvj
12m Normal ScalingReplicaSet deployment/gpu-operator-node-feature-discovery-master Scaled up replica set gpu-operator-node-feature-discovery-master-67769784f5 to 1
12m Normal Scheduled pod/gpu-operator-node-feature-discovery-worker-6bwfw Successfully assigned gpu-operator/gpu-operator-node-feature-discovery-worker-6bwfw to de9e0472.secctr.com
12m Normal Pulled pod/gpu-operator-node-feature-discovery-worker-6bwfw Container image "registry.k8s.io/nfd/node-feature-discovery:v0.16.3" already present on machine
12m Normal Created pod/gpu-operator-node-feature-discovery-worker-6bwfw Created container worker
12m Normal Started pod/gpu-operator-node-feature-discovery-worker-6bwfw Started container worker
12m Normal SuccessfulCreate daemonset/gpu-operator-node-feature-discovery-worker Created pod: gpu-operator-node-feature-discovery-worker-6bwfw
12m Normal ScalingReplicaSet deployment/gpu-operator Scaled up replica set gpu-operator-7d66589d9b to 1
12m Normal Scheduled pod/nvidia-container-toolkit-daemonset-lkqz9 Successfully assigned gpu-operator/nvidia-container-toolkit-daemonset-lkqz9 to de9e0472.secctr.com
12m Normal Pulled pod/nvidia-container-toolkit-daemonset-lkqz9 Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v24.6.2" already present on machine
12m Normal Created pod/nvidia-container-toolkit-daemonset-lkqz9 Created container driver-validation
12m Normal Started pod/nvidia-container-toolkit-daemonset-lkqz9 Started container driver-validation
12m Normal SuccessfulCreate daemonset/nvidia-container-toolkit-daemonset Created pod: nvidia-container-toolkit-daemonset-lkqz9
12m Normal Scheduled pod/nvidia-dcgm-exporter-lxjqv Successfully assigned gpu-operator/nvidia-dcgm-exporter-lxjqv to de9e0472.secctr.com
12m Normal Pulled pod/nvidia-dcgm-exporter-lxjqv Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v24.6.2" already present on machine
12m Normal Created pod/nvidia-dcgm-exporter-lxjqv Created container toolkit-validation
12m Normal Started pod/nvidia-dcgm-exporter-lxjqv Started container toolkit-validation
12m Normal SuccessfulCreate daemonset/nvidia-dcgm-exporter Created pod: nvidia-dcgm-exporter-lxjqv
12m Normal Scheduled pod/nvidia-device-plugin-daemonset-5tncc Successfully assigned gpu-operator/nvidia-device-plugin-daemonset-5tncc to de9e0472.secctr.com
12m Normal Pulled pod/nvidia-device-plugin-daemonset-5tncc Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v24.6.2" already present on machine
12m Normal Created pod/nvidia-device-plugin-daemonset-5tncc Created container toolkit-validation
12m Normal Started pod/nvidia-device-plugin-daemonset-5tncc Started container toolkit-validation
12m Normal SuccessfulCreate daemonset/nvidia-device-plugin-daemonset Created pod: nvidia-device-plugin-daemonset-5tncc
12m Normal Scheduled pod/nvidia-driver-daemonset-vpvv6 Successfully assigned gpu-operator/nvidia-driver-daemonset-vpvv6 to de9e0472.secctr.com
12m Normal Pulled pod/nvidia-driver-daemonset-vpvv6 Container image "nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.10" already present on machine
12m Normal Created pod/nvidia-driver-daemonset-vpvv6 Created container k8s-driver-manager
12m Normal Started pod/nvidia-driver-daemonset-vpvv6 Started container k8s-driver-manager
12m Normal Killing pod/nvidia-driver-daemonset-vpvv6 Stopping container k8s-driver-manager
12m Normal SuccessfulCreate daemonset/nvidia-driver-daemonset Created pod: nvidia-driver-daemonset-vpvv6
12m Normal SuccessfulDelete daemonset/nvidia-driver-daemonset Deleted pod: nvidia-driver-daemonset-vpvv6
12m Normal Scheduled pod/nvidia-operator-validator-jctzl Successfully assigned gpu-operator/nvidia-operator-validator-jctzl to de9e0472.secctr.com
12m Normal Pulled pod/nvidia-operator-validator-jctzl Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v24.6.2" already present on machine
12m Normal Created pod/nvidia-operator-validator-jctzl Created container driver-validation
12m Normal Started pod/nvidia-operator-validator-jctzl Started container driver-validation
12m Normal SuccessfulCreate daemonset/nvidia-operator-validator Created pod: nvidia-operator-validator-jctzl
infomation of /run/nvidia/
# ls /run/nvidia/
driver mps toolkit validations
# ls /run/nvidia/driver/
lib
# ls /run/nvidia/driver/lib/
firmware
# ls /run/nvidia/driver/lib/firmware/
# ls /run/nvidia/mps/
# ls /run/nvidia/toolkit/
# ls /run/nvidia/validations/
#
question: how to setup /run/nvidia/driver/ with nvidia-smi and so on?
The text was updated successfully, but these errors were encountered:
@vanloswang check that daemonset/nvidia-driver-daemonset pods are up and running. These pods do all the stuff for the driver compilation and installation. If the pods are not up and running check their logs and fix all the errors.
Successful driver compilation and installation log should end with the next rows:
Parsing kernel module parameters...
Loading ipmi and i2c_core kernel modules...
Loading NVIDIA driver kernel modules...
+ modprobe nvidia
+ modprobe nvidia-uvm
+ modprobe nvidia-modeset
+ set +o xtrace -o nounset
Starting NVIDIA persistence daemon...
Mounting NVIDIA driver rootfs...
Check SELinux status
SELinux is enabled
Change device files security context for selinux compatibility
Done, now waiting for signal
Once nvidia-driver-daemonset are up and running all the remaining pods related to the Nvidia should find all the executables and get up.
OS environment information
###GPU environment information
the installtion step of the GPU Operator
error information of driver-validation pod
K8S event info
infomation of /run/nvidia/
question: how to setup /run/nvidia/driver/ with nvidia-smi and so on?
The text was updated successfully, but these errors were encountered: