Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-peermem-ctr: /usr/local/bin/nvidia-driver: line 769: RHEL_VERSION: unbound variable #609

Open
5 tasks done
takeshi-yoshimura opened this issue Nov 14, 2023 · 2 comments

Comments

@takeshi-yoshimura
Copy link

takeshi-yoshimura commented Nov 14, 2023

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): RHCOS4.13
  • Kernel Version: 4.18.0-372.59.1.el8_6.x86_64
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): CRI-O
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): OCP
  • GPU Operator Version: v23.9.0

2. Issue or feature description

Container nvidia-peermem-ctr in pod nvidia-driver-daemonset crashed. As I post in the below, the log said RHEL_VERSION was not set. I think the container should have mounted /etc/os-release so that it can inspect RHEL_VERSION like other containers in the same pod. The failure was at the line DNF_RELEASEVER="${RHEL_VERSION}" in /usr/local/bin/nvidia-driver.

3. Steps to reproduce the issue

I just installed the gpu operator recently with mostly default setting on a RHCOS4.13/OpenShift4.12 cluster (spec.driver.rdma.enabled=true and spec.driver.rdma.useHostMofed=false).

My workaround was just to specifying driver version to be older one (535.104.05) instead of the latest one (535.104.12?) in spec.driver.version in my clusterpolicy.

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  driver:
     version: 535.104.05
     image: driver
     repository: nvcr.io/nvidia
...

4. Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
  • kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
  • If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
  • If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
  • Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi

kubectl get po:

NAME                                                   READY   STATUS             RESTARTS          AGE
console-plugin-nvidia-gpu-5df7b85d4-m7knm              1/1     Running            0                 10d
gpu-feature-discovery-74jf4                            1/1     Running            0                 3h45m
gpu-feature-discovery-d9bfz                            1/1     Running            0                 2d19h
gpu-feature-discovery-k6q22                            1/1     Running            0                 37m
gpu-operator-6d95d776d6-bvdng                          1/1     Running            0                 2d20h
grafana-deployment-6b4f9fcc9d-4dkcg                    1/1     Running            0                 10d
grafana-operator-controller-manager-595c7978b9-bq95m   2/2     Running            0                 2d20h
nvidia-container-toolkit-daemonset-6hpr8               1/1     Running            0                 37m
nvidia-container-toolkit-daemonset-d76d8               1/1     Running            0                 2d19h
nvidia-container-toolkit-daemonset-wk65s               1/1     Running            0                 3h45m
nvidia-cuda-validator-f5dvk                            0/1     Completed          0                 37m
nvidia-cuda-validator-mjhhf                            0/1     Completed          0                 3h43m
nvidia-dcgm-2bj8w                                      1/1     Running            0                 3h45m
nvidia-dcgm-dm22g                                      1/1     Running            0                 37m
nvidia-dcgm-exporter-cgc6f                             1/1     Running            0                 2d19h
nvidia-dcgm-exporter-jvscf                             1/1     Running            0                 37m
nvidia-dcgm-exporter-m2jqz                             1/1     Running            0                 3h45m
nvidia-dcgm-p5mds                                      1/1     Running            0                 2d19h
nvidia-device-plugin-daemonset-6jsmf                   1/1     Running            0                 3h45m
nvidia-device-plugin-daemonset-jlkw8                   1/1     Running            0                 37m
nvidia-device-plugin-daemonset-rx7hv                   1/1     Running            0                 2d19h
nvidia-driver-daemonset-412.86.202306132230-0-fkjbj    2/3     CrashLoopBackOff   793 (69s ago)     2d19h
nvidia-driver-daemonset-412.86.202306132230-0-jnt77    2/3     CrashLoopBackOff   792 (4m45s ago)   2d19h
nvidia-driver-daemonset-412.86.202306132230-0-w9mv5    2/3     CrashLoopBackOff   48 (4m48s ago)    3h53m
nvidia-mig-manager-2nqpt                               1/1     Running            0                 3h42m
nvidia-mig-manager-b74w5                               1/1     Running            0                 2d19h
nvidia-mig-manager-hgb5r                               1/1     Running            0                 37m
nvidia-node-status-exporter-xbdc7                      1/1     Running            0                 2d19h
nvidia-node-status-exporter-xg627                      1/1     Running            0                 2d19h
nvidia-node-status-exporter-z56bw                      1/1     Running            0                 3h53m
nvidia-operator-validator-d2wjd                        1/1     Running            0                 2d19h
nvidia-operator-validator-j8cmt                        1/1     Running            0                 3h45m
nvidia-operator-validator-lf8tm                        1/1     Running            0                 37m

oc logs nvidia-driver-daemonset-412.86.202306132230-0 -c nvidia-peermem-ctr

+ set -eu
+ RUN_DIR=/run/nvidia
+ PID_FILE=/run/nvidia/nvidia-driver.pid
+ DRIVER_VERSION=535.104.12
+ KERNEL_UPDATE_HOOK=/run/kernel/postinst.d/update-nvidia-driver
+ NUM_VGPU_DEVICES=0
+ NVIDIA_MODULE_PARAMS=()
+ NVIDIA_UVM_MODULE_PARAMS=()
+ NVIDIA_MODESET_MODULE_PARAMS=()
+ NVIDIA_PEERMEM_MODULE_PARAMS=()
+ TARGETARCH=amd64
+ USE_HOST_MOFED=false
+ DNF_RELEASEVER=
+ OPENSHIFT_VERSION=
+ DRIVER_ARCH=x86_64
+ DRIVER_ARCH=x86_64
+ echo 'DRIVER_ARCH is x86_64'
DRIVER_ARCH is x86_64
+++ dirname -- /usr/local/bin/nvidia-driver
++ cd -- /usr/local/bin
++ pwd
+ SCRIPT_DIR=/usr/local/bin
+ source /usr/local/bin/common.sh
++ GPU_DIRECT_RDMA_ENABLED=false
++ GDS_ENABLED=false
+ '[' 1 -eq 0 ']'
+ command=reload_nvidia_peermem
+ shift
+ case "${command}" in
+ options=
+ '[' 0 -ne 0 ']'
+ eval set -- ''
++ set --
+ ACCEPT_LICENSE=
++ uname -r
+ KERNEL_VERSION=4.18.0-372.59.1.el8_6.x86_64
+ PRIVATE_KEY=
+ PACKAGE_TAG=
+ '[' 0 -ne 0 ']'
+ [[ -z '' ]]
+ _resolve_rhel_version
+ '[' -f /host-etc/os-release ']'
+ return 0
/usr/local/bin/nvidia-driver: line 769: RHEL_VERSION: unbound variable

oc describe po nvidia-driver-daemonset-412.86.202306132230-0

Name:                 nvidia-driver-daemonset-412.86.202306132230-0-fkjbj
Namespace:            nvidia-gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      nvidia-driver
Node:                 gdr-test-6p2kc-gdr-gpu-il-worker-3-jqpk7/10.241.128.27
Start Time:           Sat, 11 Nov 2023 13:46:21 +0900
Labels:               app=nvidia-driver-daemonset-412.86.202306132230-0
                      app.kubernetes.io/component=nvidia-driver
                      controller-revision-hash=56f9b89d7c
                      nvidia.com/precompiled=false
                      openshift.driver-toolkit=true
                      pod-template-generation=1
Annotations:          k8s.ovn.org/pod-networks:
                        {"default":{"ip_addresses":["10.130.4.10/23"],"mac_address":"0a:58:0a:82:04:0a","gateway_ips":["10.130.4.1"],"ip_address":"10.130.4.10/23"...
                      k8s.v1.cni.cncf.io/network-status:
                        [{
                            "name": "ovn-kubernetes",
                            "interface": "eth0",
                            "ips": [
                                "10.130.4.10"
                            ],
                            "mac": "0a:58:0a:82:04:0a",
                            "default": true,
                            "dns": {}
                        }]
                      k8s.v1.cni.cncf.io/networks-status:
                        [{
                            "name": "ovn-kubernetes",
                            "interface": "eth0",
                            "ips": [
                                "10.130.4.10"
                            ],
                            "mac": "0a:58:0a:82:04:0a",
                            "default": true,
                            "dns": {}
                        }]
                      kubectl.kubernetes.io/default-container: nvidia-driver-ctr
                      openshift.io/scc: nvidia-driver
Status:               Running
IP:                   10.130.4.10
IPs:
  IP:           10.130.4.10
Controlled By:  DaemonSet/nvidia-driver-daemonset-412.86.202306132230-0
Init Containers:
  mofed-validation:
    Container ID:  cri-o://643d541ca8e6a969364807d44f564bd1d92a0bacf23fcf73e6a23d17aa3b36e6
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:c3fc8ab2d39d970e3d1a1b0ef16b06792d23cc87be68ed4927c7384ddd1f43cb
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:47a658fa7102d99a5dd9fe05f2a5b872deab266138e7955a14ba59e33095738d
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sat, 11 Nov 2023 13:47:08 +0900
      Finished:     Sat, 11 Nov 2023 13:53:18 +0900
    Ready:          True
    Restart Count:  0
    Environment:
      WITH_WAIT:                true
      COMPONENT:                mofed
      NODE_NAME:                 (v1:spec.nodeName)
      NVIDIA_VISIBLE_DEVICES:   void
      GPU_DIRECT_RDMA_ENABLED:  true
    Mounts:
      /run/mellanox/drivers from run-mellanox-drivers (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tsfd2 (ro)
  k8s-driver-manager:
    Container ID:  cri-o://d6dabf0b91a9bef8048c2c2c6da3dd51008e1ef5f58e607b9164b636f15411b6
    Image:         nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:a360ed5b1335436ef61cd601fa776e6d03f15f76aeaa8d88bd1506edd93843dc
    Image ID:      nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:5ca81f4f7e55f7b304dbbb7aaa235fca2656789145e4b34f47a7ab7079704dc7
    Port:          <none>
    Host Port:     <none>
    Command:
      driver-manager
    Args:
      uninstall_driver
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sat, 11 Nov 2023 13:53:38 +0900
      Finished:     Sat, 11 Nov 2023 13:54:11 +0900
    Ready:          True
    Restart Count:  0
    Environment:
      NODE_NAME:                    (v1:spec.nodeName)
      NVIDIA_VISIBLE_DEVICES:      void
      ENABLE_GPU_POD_EVICTION:     true
      ENABLE_AUTO_DRAIN:           true
      DRAIN_USE_FORCE:             false
      DRAIN_POD_SELECTOR_LABEL:    
      DRAIN_TIMEOUT_SECONDS:       0s
      DRAIN_DELETE_EMPTYDIR_DATA:  false
      OPERATOR_NAMESPACE:          nvidia-gpu-operator (v1:metadata.namespace)
    Mounts:
      /host from host-root (ro)
      /run/nvidia from run-nvidia (rw)
      /sys from host-sys (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tsfd2 (ro)
Containers:
  nvidia-driver-ctr:
    Container ID:  cri-o://fcbeb32582dde85e6275cd869abac95ba9df285bf004d5d2a3a763a2465bc82f
    Image:         nvcr.io/nvidia/driver@sha256:7c2df95df9ed4d16ad3b3c84079ccdd161f3639527ac1d90b106217f9f0a3aad
    Image ID:      nvcr.io/nvidia/driver@sha256:00d2137e198eeb72dd972494e2a651e1f67556fcb1f5a93650868f5b2115de8d
    Port:          <none>
    Host Port:     <none>
    Command:
      ocp_dtk_entrypoint
    Args:
      nv-ctr-run-with-dtk
    State:          Running
      Started:      Sat, 11 Nov 2023 13:54:32 +0900
    Ready:          True
    Restart Count:  0
    Startup:        exec [sh -c nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready] delay=60s timeout=60s period=10s #success=1 #failure=120
    Environment:
      GPU_DIRECT_RDMA_ENABLED:  true
      OPENSHIFT_VERSION:        4.12
    Mounts:
      /dev/log from dev-log (rw)
      /host-etc/os-release from host-os-release (ro)
      /mnt/shared-nvidia-driver-toolkit from shared-nvidia-driver-toolkit (rw)
      /run/mellanox/drivers from run-mellanox-drivers (rw)
      /run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)
      /run/nvidia from run-nvidia (rw)
      /run/nvidia-topologyd from run-nvidia-topologyd (rw)
      /var/log from var-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tsfd2 (ro)
  nvidia-peermem-ctr:
    Container ID:  cri-o://b35eb3fd675b9c5125fd486bd44fe0fdb20eceb7d9a6c14d8511a9d738cb7db0
    Image:         nvcr.io/nvidia/driver@sha256:7c2df95df9ed4d16ad3b3c84079ccdd161f3639527ac1d90b106217f9f0a3aad
    Image ID:      nvcr.io/nvidia/driver@sha256:00d2137e198eeb72dd972494e2a651e1f67556fcb1f5a93650868f5b2115de8d
    Port:          <none>
    Host Port:     <none>
    Command:
      nvidia-driver
    Args:
      reload_nvidia_peermem
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 14 Nov 2023 09:11:17 +0900
      Finished:     Tue, 14 Nov 2023 09:11:17 +0900
    Ready:          False
    Restart Count:  793
    Liveness:       exec [sh -c nvidia-driver probe_nvidia_peermem] delay=30s timeout=10s period=30s #success=1 #failure=1
    Startup:        exec [sh -c nvidia-driver probe_nvidia_peermem] delay=10s timeout=10s period=10s #success=1 #failure=120
    Environment:    <none>
    Mounts:
      /dev/log from dev-log (ro)
      /run/mellanox/drivers from run-mellanox-drivers (rw)
      /run/nvidia from run-nvidia (rw)
      /var/log from var-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tsfd2 (ro)
  openshift-driver-toolkit-ctr:
    Container ID:  cri-o://d61b7fd1fb0e713c16464b2db64713cc2dca8a6b047f86f46501ee1317f9f41e
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76662154f549f1edde1b61aeebee11b5e23ea3c4809551532c2edcd6ad1993db
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76662154f549f1edde1b61aeebee11b5e23ea3c4809551532c2edcd6ad1993db
    Port:          <none>
    Host Port:     <none>
    Command:
      bash
      -xc
    Args:
      until [ -f /mnt/shared-nvidia-driver-toolkit/dir_prepared ]; do echo  Waiting for nvidia-driver-ctr container to prepare the shared directory ...; sleep 10; done; exec /mnt/shared-nvidia-driver-toolkit/ocp_dtk_entrypoint dtk-build-driver
    State:          Running
      Started:      Sat, 11 Nov 2023 13:55:04 +0900
    Ready:          True
    Restart Count:  0
    Environment:
      RHCOS_VERSION:           412.86.202306132230-0
      NVIDIA_VISIBLE_DEVICES:  void
    Mounts:
      /host-etc/os-release from host-os-release (ro)
      /mnt/shared-nvidia-driver-toolkit from shared-nvidia-driver-toolkit (rw)
      /run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)
      /var/log from var-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tsfd2 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  run-nvidia:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia
    HostPathType:  DirectoryOrCreate
  var-log:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log
    HostPathType:  
  dev-log:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/log
    HostPathType:  
  host-os-release:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/os-release
    HostPathType:  
  run-nvidia-topologyd:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia-topologyd
    HostPathType:  DirectoryOrCreate
  mlnx-ofed-usr-src:
    Type:          HostPath (bare host directory volume)
    Path:          /run/mellanox/drivers/usr/src
    HostPathType:  DirectoryOrCreate
  run-mellanox-drivers:
    Type:          HostPath (bare host directory volume)
    Path:          /run/mellanox/drivers
    HostPathType:  DirectoryOrCreate
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  host-sys:
    Type:          HostPath (bare host directory volume)
    Path:          /sys
    HostPathType:  Directory
  shared-nvidia-driver-toolkit:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-tsfd2:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   BestEffort
Node-Selectors:              feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=412.86.202306132230-0
                             nvidia.com/gpu.deploy.driver=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason   Age                        From     Message
  ----     ------   ----                       ----     -------
  Warning  BackOff  3m23s (x20021 over 2d19h)  kubelet  Back-off restarting failed container


Name:                 nvidia-driver-daemonset-412.86.202306132230-0-jnt77
Namespace:            nvidia-gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      nvidia-driver
Node:                 gdr-test-6p2kc-gdr-gpu-il-worker-3-5f7m6/10.241.128.26
Start Time:           Sat, 11 Nov 2023 13:46:21 +0900
Labels:               app=nvidia-driver-daemonset-412.86.202306132230-0
                      app.kubernetes.io/component=nvidia-driver
                      controller-revision-hash=56f9b89d7c
                      nvidia.com/precompiled=false
                      openshift.driver-toolkit=true
                      pod-template-generation=1
Annotations:          k8s.ovn.org/pod-networks:
                        {"default":{"ip_addresses":["10.129.4.15/23"],"mac_address":"0a:58:0a:81:04:0f","gateway_ips":["10.129.4.1"],"ip_address":"10.129.4.15/23"...
                      k8s.v1.cni.cncf.io/network-status:
                        [{
                            "name": "ovn-kubernetes",
                            "interface": "eth0",
                            "ips": [
                                "10.129.4.15"
                            ],
                            "mac": "0a:58:0a:81:04:0f",
                            "default": true,
                            "dns": {}
                        }]
                      k8s.v1.cni.cncf.io/networks-status:
                        [{
                            "name": "ovn-kubernetes",
                            "interface": "eth0",
                            "ips": [
                                "10.129.4.15"
                            ],
                            "mac": "0a:58:0a:81:04:0f",
                            "default": true,
                            "dns": {}
                        }]
                      kubectl.kubernetes.io/default-container: nvidia-driver-ctr
                      openshift.io/scc: nvidia-driver
Status:               Running
IP:                   10.129.4.15
IPs:
  IP:           10.129.4.15
Controlled By:  DaemonSet/nvidia-driver-daemonset-412.86.202306132230-0
Init Containers:
  mofed-validation:
    Container ID:  cri-o://839bada30f72f1352910ddbadd423f696df70d6261612209b6f66747ab3dc0e2
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:c3fc8ab2d39d970e3d1a1b0ef16b06792d23cc87be68ed4927c7384ddd1f43cb
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:47a658fa7102d99a5dd9fe05f2a5b872deab266138e7955a14ba59e33095738d
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sat, 11 Nov 2023 13:47:08 +0900
      Finished:     Sat, 11 Nov 2023 13:52:49 +0900
    Ready:          True
    Restart Count:  0
    Environment:
      WITH_WAIT:                true
      COMPONENT:                mofed
      NODE_NAME:                 (v1:spec.nodeName)
      NVIDIA_VISIBLE_DEVICES:   void
      GPU_DIRECT_RDMA_ENABLED:  true
    Mounts:
      /run/mellanox/drivers from run-mellanox-drivers (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gjfks (ro)
  k8s-driver-manager:
    Container ID:  cri-o://d38add89b79b81ed16884186c202b97544da02b38f2788da6bbf996e284da0f7
    Image:         nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:a360ed5b1335436ef61cd601fa776e6d03f15f76aeaa8d88bd1506edd93843dc
    Image ID:      nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:5ca81f4f7e55f7b304dbbb7aaa235fca2656789145e4b34f47a7ab7079704dc7
    Port:          <none>
    Host Port:     <none>
    Command:
      driver-manager
    Args:
      uninstall_driver
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sat, 11 Nov 2023 13:53:05 +0900
      Finished:     Sat, 11 Nov 2023 13:53:38 +0900
    Ready:          True
    Restart Count:  0
    Environment:
      NODE_NAME:                    (v1:spec.nodeName)
      NVIDIA_VISIBLE_DEVICES:      void
      ENABLE_GPU_POD_EVICTION:     true
      ENABLE_AUTO_DRAIN:           true
      DRAIN_USE_FORCE:             false
      DRAIN_POD_SELECTOR_LABEL:    
      DRAIN_TIMEOUT_SECONDS:       0s
      DRAIN_DELETE_EMPTYDIR_DATA:  false
      OPERATOR_NAMESPACE:          nvidia-gpu-operator (v1:metadata.namespace)
    Mounts:
      /host from host-root (ro)
      /run/nvidia from run-nvidia (rw)
      /sys from host-sys (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gjfks (ro)
Containers:
  nvidia-driver-ctr:
    Container ID:  cri-o://cbbcb472eba95611199c2099390911ea2cbd1715b2e8ee44a165fb2b3dffd1dc
    Image:         nvcr.io/nvidia/driver@sha256:7c2df95df9ed4d16ad3b3c84079ccdd161f3639527ac1d90b106217f9f0a3aad
    Image ID:      nvcr.io/nvidia/driver@sha256:00d2137e198eeb72dd972494e2a651e1f67556fcb1f5a93650868f5b2115de8d
    Port:          <none>
    Host Port:     <none>
    Command:
      ocp_dtk_entrypoint
    Args:
      nv-ctr-run-with-dtk
    State:          Running
      Started:      Sat, 11 Nov 2023 13:53:49 +0900
    Ready:          True
    Restart Count:  0
    Startup:        exec [sh -c nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready] delay=60s timeout=60s period=10s #success=1 #failure=120
    Environment:
      GPU_DIRECT_RDMA_ENABLED:  true
      OPENSHIFT_VERSION:        4.12
    Mounts:
      /dev/log from dev-log (rw)
      /host-etc/os-release from host-os-release (ro)
      /mnt/shared-nvidia-driver-toolkit from shared-nvidia-driver-toolkit (rw)
      /run/mellanox/drivers from run-mellanox-drivers (rw)
      /run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)
      /run/nvidia from run-nvidia (rw)
      /run/nvidia-topologyd from run-nvidia-topologyd (rw)
      /var/log from var-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gjfks (ro)
  nvidia-peermem-ctr:
    Container ID:  cri-o://0b03f994e5e99b3b8ac54b3cea06a4df1ce783fc5fb1f41ec31f18933862dd72
    Image:         nvcr.io/nvidia/driver@sha256:7c2df95df9ed4d16ad3b3c84079ccdd161f3639527ac1d90b106217f9f0a3aad
    Image ID:      nvcr.io/nvidia/driver@sha256:00d2137e198eeb72dd972494e2a651e1f67556fcb1f5a93650868f5b2115de8d
    Port:          <none>
    Host Port:     <none>
    Command:
      nvidia-driver
    Args:
      reload_nvidia_peermem
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 14 Nov 2023 09:12:42 +0900
      Finished:     Tue, 14 Nov 2023 09:12:42 +0900
    Ready:          False
    Restart Count:  793
    Liveness:       exec [sh -c nvidia-driver probe_nvidia_peermem] delay=30s timeout=10s period=30s #success=1 #failure=1
    Startup:        exec [sh -c nvidia-driver probe_nvidia_peermem] delay=10s timeout=10s period=10s #success=1 #failure=120
    Environment:    <none>
    Mounts:
      /dev/log from dev-log (ro)
      /run/mellanox/drivers from run-mellanox-drivers (rw)
      /run/nvidia from run-nvidia (rw)
      /var/log from var-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gjfks (ro)
  openshift-driver-toolkit-ctr:
    Container ID:  cri-o://2d2dcf8bdf1ac22860bb69253caf06106436db8ff151411664f5b113d7cfda02
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76662154f549f1edde1b61aeebee11b5e23ea3c4809551532c2edcd6ad1993db
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76662154f549f1edde1b61aeebee11b5e23ea3c4809551532c2edcd6ad1993db
    Port:          <none>
    Host Port:     <none>
    Command:
      bash
      -xc
    Args:
      until [ -f /mnt/shared-nvidia-driver-toolkit/dir_prepared ]; do echo  Waiting for nvidia-driver-ctr container to prepare the shared directory ...; sleep 10; done; exec /mnt/shared-nvidia-driver-toolkit/ocp_dtk_entrypoint dtk-build-driver
    State:          Running
      Started:      Sat, 11 Nov 2023 13:54:08 +0900
    Ready:          True
    Restart Count:  0
    Environment:
      RHCOS_VERSION:           412.86.202306132230-0
      NVIDIA_VISIBLE_DEVICES:  void
    Mounts:
      /host-etc/os-release from host-os-release (ro)
      /mnt/shared-nvidia-driver-toolkit from shared-nvidia-driver-toolkit (rw)
      /run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)
      /var/log from var-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-gjfks (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  run-nvidia:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia
    HostPathType:  DirectoryOrCreate
  var-log:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log
    HostPathType:  
  dev-log:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/log
    HostPathType:  
  host-os-release:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/os-release
    HostPathType:  
  run-nvidia-topologyd:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia-topologyd
    HostPathType:  DirectoryOrCreate
  mlnx-ofed-usr-src:
    Type:          HostPath (bare host directory volume)
    Path:          /run/mellanox/drivers/usr/src
    HostPathType:  DirectoryOrCreate
  run-mellanox-drivers:
    Type:          HostPath (bare host directory volume)
    Path:          /run/mellanox/drivers
    HostPathType:  DirectoryOrCreate
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  host-sys:
    Type:          HostPath (bare host directory volume)
    Path:          /sys
    HostPathType:  Directory
  shared-nvidia-driver-toolkit:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-gjfks:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   BestEffort
Node-Selectors:              feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=412.86.202306132230-0
                             nvidia.com/gpu.deploy.driver=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason   Age                        From     Message
  ----     ------   ----                       ----     -------
  Normal   Pulled   43m (x786 over 2d19h)      kubelet  Container image "nvcr.io/nvidia/driver@sha256:7c2df95df9ed4d16ad3b3c84079ccdd161f3639527ac1d90b106217f9f0a3aad" already present on machine
  Warning  BackOff  3m15s (x20037 over 2d19h)  kubelet  Back-off restarting failed container


Name:                 nvidia-driver-daemonset-412.86.202306132230-0-w9mv5
Namespace:            nvidia-gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      nvidia-driver
Node:                 gdr-test-6p2kc-gdr-gpu-il-worker-3-rjkr9/10.241.128.30
Start Time:           Tue, 14 Nov 2023 05:19:12 +0900
Labels:               app=nvidia-driver-daemonset-412.86.202306132230-0
                      app.kubernetes.io/component=nvidia-driver
                      controller-revision-hash=56f9b89d7c
                      nvidia.com/precompiled=false
                      openshift.driver-toolkit=true
                      pod-template-generation=1
Annotations:          k8s.ovn.org/pod-networks:
                        {"default":{"ip_addresses":["10.129.6.14/23"],"mac_address":"0a:58:0a:81:06:0e","gateway_ips":["10.129.6.1"],"ip_address":"10.129.6.14/23"...
                      k8s.v1.cni.cncf.io/network-status:
                        [{
                            "name": "ovn-kubernetes",
                            "interface": "eth0",
                            "ips": [
                                "10.129.6.14"
                            ],
                            "mac": "0a:58:0a:81:06:0e",
                            "default": true,
                            "dns": {}
                        }]
                      k8s.v1.cni.cncf.io/networks-status:
                        [{
                            "name": "ovn-kubernetes",
                            "interface": "eth0",
                            "ips": [
                                "10.129.6.14"
                            ],
                            "mac": "0a:58:0a:81:06:0e",
                            "default": true,
                            "dns": {}
                        }]
                      kubectl.kubernetes.io/default-container: nvidia-driver-ctr
                      openshift.io/scc: nvidia-driver
Status:               Running
IP:                   10.129.6.14
IPs:
  IP:           10.129.6.14
Controlled By:  DaemonSet/nvidia-driver-daemonset-412.86.202306132230-0
Init Containers:
  mofed-validation:
    Container ID:  cri-o://a2a13f79f84105dd86aae8c33a52c1ba28d94fc65bf0734ec8744751de7a4577
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:c3fc8ab2d39d970e3d1a1b0ef16b06792d23cc87be68ed4927c7384ddd1f43cb
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:47a658fa7102d99a5dd9fe05f2a5b872deab266138e7955a14ba59e33095738d
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 14 Nov 2023 05:20:01 +0900
      Finished:     Tue, 14 Nov 2023 05:25:51 +0900
    Ready:          True
    Restart Count:  0
    Environment:
      WITH_WAIT:                true
      COMPONENT:                mofed
      NODE_NAME:                 (v1:spec.nodeName)
      NVIDIA_VISIBLE_DEVICES:   void
      GPU_DIRECT_RDMA_ENABLED:  true
    Mounts:
      /run/mellanox/drivers from run-mellanox-drivers (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tcd85 (ro)
  k8s-driver-manager:
    Container ID:  cri-o://91a1bdb7341f0b22bdd25696ee1f7e009513b245b0e434c169b278d7fb3df675
    Image:         nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:a360ed5b1335436ef61cd601fa776e6d03f15f76aeaa8d88bd1506edd93843dc
    Image ID:      nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:5ca81f4f7e55f7b304dbbb7aaa235fca2656789145e4b34f47a7ab7079704dc7
    Port:          <none>
    Host Port:     <none>
    Command:
      driver-manager
    Args:
      uninstall_driver
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 14 Nov 2023 05:26:06 +0900
      Finished:     Tue, 14 Nov 2023 05:26:38 +0900
    Ready:          True
    Restart Count:  0
    Environment:
      NODE_NAME:                    (v1:spec.nodeName)
      NVIDIA_VISIBLE_DEVICES:      void
      ENABLE_GPU_POD_EVICTION:     true
      ENABLE_AUTO_DRAIN:           true
      DRAIN_USE_FORCE:             false
      DRAIN_POD_SELECTOR_LABEL:    
      DRAIN_TIMEOUT_SECONDS:       0s
      DRAIN_DELETE_EMPTYDIR_DATA:  false
      OPERATOR_NAMESPACE:          nvidia-gpu-operator (v1:metadata.namespace)
    Mounts:
      /host from host-root (ro)
      /run/nvidia from run-nvidia (rw)
      /sys from host-sys (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tcd85 (ro)
Containers:
  nvidia-driver-ctr:
    Container ID:  cri-o://afa3c3302c3632b1dbe012c4cbd98c72bf427798731dfbb7de96a3e6f834dde2
    Image:         nvcr.io/nvidia/driver@sha256:7c2df95df9ed4d16ad3b3c84079ccdd161f3639527ac1d90b106217f9f0a3aad
    Image ID:      nvcr.io/nvidia/driver@sha256:00d2137e198eeb72dd972494e2a651e1f67556fcb1f5a93650868f5b2115de8d
    Port:          <none>
    Host Port:     <none>
    Command:
      ocp_dtk_entrypoint
    Args:
      nv-ctr-run-with-dtk
    State:          Running
      Started:      Tue, 14 Nov 2023 05:26:50 +0900
    Ready:          True
    Restart Count:  0
    Startup:        exec [sh -c nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready] delay=60s timeout=60s period=10s #success=1 #failure=120
    Environment:
      GPU_DIRECT_RDMA_ENABLED:  true
      OPENSHIFT_VERSION:        4.12
    Mounts:
      /dev/log from dev-log (rw)
      /host-etc/os-release from host-os-release (ro)
      /mnt/shared-nvidia-driver-toolkit from shared-nvidia-driver-toolkit (rw)
      /run/mellanox/drivers from run-mellanox-drivers (rw)
      /run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)
      /run/nvidia from run-nvidia (rw)
      /run/nvidia-topologyd from run-nvidia-topologyd (rw)
      /var/log from var-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tcd85 (ro)
  nvidia-peermem-ctr:
    Container ID:  cri-o://9840e4da6a6c0ade2d89706174fc4cb653a80f4224c66ece44baae4dd5675521
    Image:         nvcr.io/nvidia/driver@sha256:7c2df95df9ed4d16ad3b3c84079ccdd161f3639527ac1d90b106217f9f0a3aad
    Image ID:      nvcr.io/nvidia/driver@sha256:00d2137e198eeb72dd972494e2a651e1f67556fcb1f5a93650868f5b2115de8d
    Port:          <none>
    Host Port:     <none>
    Command:
      nvidia-driver
    Args:
      reload_nvidia_peermem
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 14 Nov 2023 09:12:50 +0900
      Finished:     Tue, 14 Nov 2023 09:12:50 +0900
    Ready:          False
    Restart Count:  49
    Liveness:       exec [sh -c nvidia-driver probe_nvidia_peermem] delay=30s timeout=10s period=30s #success=1 #failure=1
    Startup:        exec [sh -c nvidia-driver probe_nvidia_peermem] delay=10s timeout=10s period=10s #success=1 #failure=120
    Environment:    <none>
    Mounts:
      /dev/log from dev-log (ro)
      /run/mellanox/drivers from run-mellanox-drivers (rw)
      /run/nvidia from run-nvidia (rw)
      /var/log from var-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tcd85 (ro)
  openshift-driver-toolkit-ctr:
    Container ID:  cri-o://ebff88db5258e776aa02d8176fee4c780a311686fb0cf3d8b7c5f93e4e4edb70
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76662154f549f1edde1b61aeebee11b5e23ea3c4809551532c2edcd6ad1993db
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:76662154f549f1edde1b61aeebee11b5e23ea3c4809551532c2edcd6ad1993db
    Port:          <none>
    Host Port:     <none>
    Command:
      bash
      -xc
    Args:
      until [ -f /mnt/shared-nvidia-driver-toolkit/dir_prepared ]; do echo  Waiting for nvidia-driver-ctr container to prepare the shared directory ...; sleep 10; done; exec /mnt/shared-nvidia-driver-toolkit/ocp_dtk_entrypoint dtk-build-driver
    State:          Running
      Started:      Tue, 14 Nov 2023 05:27:20 +0900
    Ready:          True
    Restart Count:  0
    Environment:
      RHCOS_VERSION:           412.86.202306132230-0
      NVIDIA_VISIBLE_DEVICES:  void
    Mounts:
      /host-etc/os-release from host-os-release (ro)
      /mnt/shared-nvidia-driver-toolkit from shared-nvidia-driver-toolkit (rw)
      /run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)
      /var/log from var-log (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-tcd85 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  run-nvidia:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia
    HostPathType:  DirectoryOrCreate
  var-log:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log
    HostPathType:  
  dev-log:
    Type:          HostPath (bare host directory volume)
    Path:          /dev/log
    HostPathType:  
  host-os-release:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/os-release
    HostPathType:  
  run-nvidia-topologyd:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia-topologyd
    HostPathType:  DirectoryOrCreate
  mlnx-ofed-usr-src:
    Type:          HostPath (bare host directory volume)
    Path:          /run/mellanox/drivers/usr/src
    HostPathType:  DirectoryOrCreate
  run-mellanox-drivers:
    Type:          HostPath (bare host directory volume)
    Path:          /run/mellanox/drivers
    HostPathType:  DirectoryOrCreate
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  host-root:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:  
  host-sys:
    Type:          HostPath (bare host directory volume)
    Path:          /sys
    HostPathType:  Directory
  shared-nvidia-driver-toolkit:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  kube-api-access-tcd85:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   BestEffort
Node-Selectors:              feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=412.86.202306132230-0
                             nvidia.com/gpu.deploy.driver=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason   Age                     From     Message
  ----     ------   ----                    ----     -------
  Normal   Pulled   155m (x20 over 3h48m)   kubelet  Container image "nvcr.io/nvidia/driver@sha256:7c2df95df9ed4d16ad3b3c84079ccdd161f3639527ac1d90b106217f9f0a3aad" already present on machine
  Warning  BackOff  43s (x1136 over 3h47m)  kubelet  Back-off restarting failed container

Containers.nvidia-peermem-ctr should have had a mount for /host-etc/os-release but unfortunately it didn't have.

@shivamerla
Copy link
Contributor

@takeshi-yoshimura we are aware of this issue and fixing it as part of v23.9.1 release later this month. To workaround you can edit the nvidia-driver-daemonset and add an env RHEL_VERSION="" to nvidia-peermem-ctr container.

@takeshi-yoshimura
Copy link
Author

Sounds good. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants