error getting vGPU config: error getting all vGPU devices: unable to read MDEV devices directory: open /sys/bus/mdev/devices: no such file or directory #591

ppetko · 2023-09-28T16:19:51Z

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04): Red Hat Enterprise Linux CoreOS release 4.11
Kernel Version: Linux 4.18.0-372.46.1.el8_6.x86_64
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): CRI-O
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): OCP 4.11
GPU Operator Version: 23.6.1 provided by NVIDIA Corporation

2. Issue or feature description

We can't configure the vGPUs using NVIDIA operator following the docs here https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/openshift/openshift-virtualization.html

3. Steps to reproduce the issue

Install NVIDIA operator and create ClusterPolicy with the following parameters for the vGPUs

sandboxWorloads.enabled=true
vgpuManager.enabled=true
vgpuManager.repository=<path to private repository>
vgpuManager.image=vgpu-manager
vgpuManager.version=<driver version>
vgpuManager.imagePullSecrets={<name of image pull secret>}

This is our cluster policy

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  vgpuDeviceManager:
    config:
      default: default
    enabled: true
  migManager:
    config:
      default: all-disabled
      name: default-mig-parted-config
    enabled: true
  operator:
    defaultRuntime: crio
    initContainer: {}
    runtimeClass: nvidia
    use_ocp_driver_toolkit: true
  dcgm:
    enabled: true
  gfd:
    enabled: true
  dcgmExporter:
    config:
      name: ''
    enabled: true
    serviceMonitor:
      enabled: true
  cdi:
    default: false
    enabled: false
  driver:
    certConfig:
      name: ''
    enabled: true
    kernelModuleConfig:
      name: ''
    licensingConfig:
      configMapName: licensing-config
      nlsEnabled: true
    repoConfig:
      configMapName: ''
    upgradePolicy:
      autoUpgrade: true
      drain:
        deleteEmptyDir: false
        enable: false
        force: false
        timeoutSeconds: 300
      maxParallelUpgrades: 1
      maxUnavailable: 25%
      podDeletion:
        deleteEmptyDir: false
        force: false
        timeoutSeconds: 300
      waitForCompletion:
        timeoutSeconds: 0
    virtualTopology:
      config: ''
  devicePlugin:
    config:
      default: ''
      name: ''
    enabled: true
  kataManager:
    config:
      artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses
  mig:
    strategy: single
  sandboxDevicePlugin:
    enabled: true
  validator:
    plugin:
      env:
        - name: WITH_WORKLOAD
          value: 'false'
  nodeStatusExporter:
    enabled: true
  daemonsets:
    rollingUpdate:
      maxUnavailable: '1'
    updateStrategy: RollingUpdate
  sandboxWorkloads:
    defaultWorkload: vm-vgpu
    enabled: true
  gds:
    enabled: false
  vgpuManager:
    driverManager:
      image: vgpu-manager
      repository: default-route-openshift-image-registry.apps.ocp4.poc.site/pp-testing
      version: 535.104.06-rhcos4.11
    enabled: true
  vfioManager:
    enabled: true
  toolkit:
    enabled: true
    installDir: /usr/local/nvidia

4. Debug info

4.1 When we specify this label nvidia.com/vgpu.config=A100-1-5C for each node

oc logs -f nvidia-vgpu-device-manager-69wm6 
Defaulted container "nvidia-vgpu-device-manager" out of: nvidia-vgpu-device-manager, vgpu-manager-validation (init)
W0928 14:49:52.314862       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2023-09-28T14:49:52Z" level=info msg="Updating to vGPU config: A100-1-5CME"
time="2023-09-28T14:49:52Z" level=info msg="Asserting that the requested configuration is present in the configuration file"
time="2023-09-28T14:49:52Z" level=info msg="Selected vGPU device configuration is valid"
time="2023-09-28T14:49:52Z" level=info msg="Checking if the selected vGPU device configuration is currently applied or not"
time="2023-09-28T14:49:52Z" level=fatal msg="Assertion failure: selected configuration not currently applied"
time="2023-09-28T14:49:52Z" level=info msg="Getting current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label"
time="2023-09-28T14:49:52Z" level=info msg="Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin=true'"
time="2023-09-28T14:49:52Z" level=info msg="Getting current value of 'nvidia.com/gpu.deploy.sandbox-validator' node label"
time="2023-09-28T14:49:52Z" level=info msg="Current value of 'nvidia.com/gpu.deploy.sandbox-validator=true'"
time="2023-09-28T14:49:52Z" level=info msg="Getting current value of 'nvidia.com/vgpu.config.state' node label"
time="2023-09-28T14:49:52Z" level=info msg="Current value of 'nvidia.com/vgpu.config.state=failed'"
time="2023-09-28T14:49:52Z" level=info msg="Changing the 'nvidia.com/vgpu.config.state' node label to 'pending'"
time="2023-09-28T14:49:52Z" level=info msg="Shutting down all GPU operands in Kubernetes by disabling their component-specific nodeSelector labels"
time="2023-09-28T14:49:52Z" level=info msg="Waiting for sandbox-device-plugin to shutdown"
time="2023-09-28T14:50:23Z" level=info msg="Waiting for sandbox-validator to shutdown"
time="2023-09-28T14:50:23Z" level=info msg="Applying the selected vGPU device configuration to the node"
time="2023-09-28T14:50:23Z" level=debug msg="Parsing config file..."
time="2023-09-28T14:50:23Z" level=debug msg="Selecting specific vGPU config..."
time="2023-09-28T14:50:23Z" level=debug msg="Checking current vGPU device configuration..."
time="2023-09-28T14:50:23Z" level=debug msg="Walking VGPUConfig for (devices=all)"
time="2023-09-28T14:50:23Z" level=debug msg="  GPU 0: 0x20B510DE"
time="2023-09-28T14:50:23Z" level=info msg="Applying vGPU device configuration..."
time="2023-09-28T14:50:23Z" level=debug msg="Walking VGPUConfig for (devices=all)"
time="2023-09-28T14:50:23Z" level=debug msg="  GPU 0: 0x20B510DE"
time="2023-09-28T14:50:23Z" level=fatal msg="error getting vGPU config: error getting all vGPU devices: unable to read MDEV devices directory: open /sys/bus/mdev/devices: no such file or directory"
time="2023-09-28T14:50:23Z" level=info msg="Changing the 'nvidia.com/vgpu.config.state' node label to 'failed'"
time="2023-09-28T14:50:23Z" level=error msg="ERROR: unable to apply config 'A100-1-5CME': exit status 1"
time="2023-09-28T14:50:23Z" level=info msg="Waiting for change to 'nvidia.com/vgpu.config' label"

4.1 When we don't specify any specific gpu labels and let the nvidia operator handle the selection

oc logs -f nvidia-vgpu-device-manager-hmqjt -c vgpu-manager-validation 
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
^C
oc logs -f nvidia-vgpu-device-manager-q8khn  -c vgpu-manager-validation 
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
^C

The text was updated successfully, but these errors were encountered:

shivamerla · 2023-09-28T19:29:52Z

@ppetko can you check logs of vgpu-manager pod to make sure if it is installed successfully?

ppetko · 2023-10-02T14:34:16Z

Hi @shivamerla ,

It looks like it failed.

oc logs -f nvidia-vgpu-device-manager-69wm6
Defaulted container "nvidia-vgpu-device-manager" out of: nvidia-vgpu-device-manager, vgpu-manager-validation (init)
W0928 14:49:52.314862       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
time="2023-09-28T14:49:52Z" level=info msg="Updating to vGPU config: A100-1-5CME"
time="2023-09-28T14:49:52Z" level=info msg="Asserting that the requested configuration is present in the configuration file"
time="2023-09-28T14:49:52Z" level=info msg="Selected vGPU device configuration is valid"
time="2023-09-28T14:49:52Z" level=info msg="Checking if the selected vGPU device configuration is currently applied or not"
time="2023-09-28T14:49:52Z" level=fatal msg="Assertion failure: selected configuration not currently applied"
time="2023-09-28T14:49:52Z" level=info msg="Getting current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label"
time="2023-09-28T14:49:52Z" level=info msg="Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin=true'"
time="2023-09-28T14:49:52Z" level=info msg="Getting current value of 'nvidia.com/gpu.deploy.sandbox-validator' node label"
time="2023-09-28T14:49:52Z" level=info msg="Current value of 'nvidia.com/gpu.deploy.sandbox-validator=true'"
time="2023-09-28T14:49:52Z" level=info msg="Getting current value of 'nvidia.com/vgpu.config.state' node label"
time="2023-09-28T14:49:52Z" level=info msg="Current value of 'nvidia.com/vgpu.config.state=failed'"
time="2023-09-28T14:49:52Z" level=info msg="Changing the 'nvidia.com/vgpu.config.state' node label to 'pending'"
time="2023-09-28T14:49:52Z" level=info msg="Shutting down all GPU operands in Kubernetes by disabling their component-specific nodeSelector labels"
time="2023-09-28T14:49:52Z" level=info msg="Waiting for sandbox-device-plugin to shutdown"
time="2023-09-28T14:50:23Z" level=info msg="Waiting for sandbox-validator to shutdown"
time="2023-09-28T14:50:23Z" level=info msg="Applying the selected vGPU device configuration to the node"
time="2023-09-28T14:50:23Z" level=debug msg="Parsing config file..."
time="2023-09-28T14:50:23Z" level=debug msg="Selecting specific vGPU config..."
time="2023-09-28T14:50:23Z" level=debug msg="Checking current vGPU device configuration..."
time="2023-09-28T14:50:23Z" level=debug msg="Walking VGPUConfig for (devices=all)"
time="2023-09-28T14:50:23Z" level=debug msg="  GPU 0: 0x20B510DE"
time="2023-09-28T14:50:23Z" level=info msg="Applying vGPU device configuration..."
time="2023-09-28T14:50:23Z" level=debug msg="Walking VGPUConfig for (devices=all)"
time="2023-09-28T14:50:23Z" level=debug msg="  GPU 0: 0x20B510DE"
time="2023-09-28T14:50:23Z" level=fatal msg="error getting vGPU config: error getting all vGPU devices: unable to read MDEV devices directory: open /sys/bus/mdev/devices: no such file or directory"
time="2023-09-28T14:50:23Z" level=info msg="Changing the 'nvidia.com/vgpu.config.state' node label to 'failed'"
time="2023-09-28T14:50:23Z" level=error msg="ERROR: unable to apply config 'A100-1-5CME': exit status 1"
time="2023-09-28T14:50:23Z" level=info msg="Waiting for change to 'nvidia.com/vgpu.config' label"
^C

cdesiniotis · 2023-10-02T14:52:55Z

@ppetko can you get logs from the vgpu-manager pod, not the vgpu-device-manager?

ppetko · 2023-10-03T14:53:13Z

@cdesiniotis there is no such pod

oc get pods 
NAME                                           READY   STATUS    RESTARTS   AGE
gpu-operator-fbb6ffcc8-gzddt                   1/1     Running   0          6d23h
nvidia-sandbox-device-plugin-daemonset-s5v5b   1/1     Running   0          4d23h
nvidia-sandbox-validator-9tmn8                 1/1     Running   0          4d23h
nvidia-vfio-manager-5j6wq                      1/1     Running   0          4d23h
nvidia-vgpu-device-manager-69wm6               1/1     Running   0          4d23h
nvidia-vgpu-device-manager-w82ds               1/1     Running   0          4d23h

This is the cluster policy I'm using

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  vgpuDeviceManager:
    config:
      default: default
    enabled: true
  migManager:
    config:
      default: all-disabled
      name: default-mig-parted-config
    enabled: true
  operator:
    defaultRuntime: crio
    initContainer: {}
    runtimeClass: nvidia
    use_ocp_driver_toolkit: true
  dcgm:
    enabled: true
  gfd:
    enabled: true
  dcgmExporter:
    config:
      name: ''
    enabled: true
    serviceMonitor:
      enabled: true
  cdi:
    default: false
    enabled: false
  driver:
    certConfig:
      name: ''
    enabled: true
    kernelModuleConfig:
      name: ''
    licensingConfig:
      configMapName: licensing-config
      nlsEnabled: true
    repoConfig:
      configMapName: ''
    upgradePolicy:
      autoUpgrade: true
      drain:
        deleteEmptyDir: false
        enable: false
        force: false
        timeoutSeconds: 300
      maxParallelUpgrades: 1
      maxUnavailable: 25%
      podDeletion:
        deleteEmptyDir: false
        force: false
        timeoutSeconds: 300
      waitForCompletion:
        timeoutSeconds: 0
    virtualTopology:
      config: ''
  devicePlugin:
    config:
      default: ''
      name: ''
    enabled: true
  kataManager:
    config:
      artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses
  mig:
    strategy: single
  sandboxDevicePlugin:
    enabled: true
  validator:
    plugin:
      env:
        - name: WITH_WORKLOAD
          value: 'false'
  nodeStatusExporter:
    enabled: true
  daemonsets:
    rollingUpdate:
      maxUnavailable: '1'
    updateStrategy: RollingUpdate
  sandboxWorkloads:
    defaultWorkload: vm-vgpu
    enabled: true
  gds:
    enabled: false
  vgpuManager:
    driverManager:
      image: vgpu-manager
      repository: default-route-openshift-image-registry.apps.ocp4.poc.site/pp-testing
      version: 535.104.06-rhcos4.11
    enabled: true
  vfioManager:
    enabled: true
  toolkit:
    enabled: true
    installDir: /usr/local/nvidia

cdesiniotis · 2023-10-03T15:59:41Z

Is vGPU manager already installed on the host (e.g. does running nvidia-smi on the host return anything)?

Can you also describe your GPU nodes? In particular I am interested in the value of this node label nvidia.com/gpu.deploy.vgpu-manager

ppetko · 2023-10-03T17:11:57Z

According to the docs, the vGPU manager should be deployed by the NVIDIA operator. In the ClusterPolicy CR I build a container image for the vGPU manager.

oc describe node gpu4 | grep vgpu-manager
                    nvidia.com/gpu.deploy.vgpu-manager=true

These are all of the nvidia labels

oc describe node gpu4 | grep nvidia.com
                    nvidia.com/gpu.deploy.cc-manager=true
                    nvidia.com/gpu.deploy.nvsm=
                    nvidia.com/gpu.deploy.sandbox-device-plugin=paused-for-vgpu-change
                    nvidia.com/gpu.deploy.sandbox-validator=paused-for-vgpu-change
                    nvidia.com/gpu.deploy.vgpu-device-manager=true
                    nvidia.com/gpu.deploy.vgpu-manager=true
                    nvidia.com/gpu.present=true
                    nvidia.com/gpu.workload.config=vm-vgpu
                    nvidia.com/mig.config=all-disabled
                    nvidia.com/mig.config.state=success
                    nvidia.com/vgpu.config=A100-2-10C
                    **nvidia.com/vgpu.config.state=failed**
  nvidia.com/A100:                 0
  nvidia.com/gpu:                  0
  nvidia.com/A100:                 0
  nvidia.com/gpu:                  0
  nvidia.com/A100                 1             1
  nvidia.com/gpu                  0             0

shivamerla · 2023-10-03T17:44:45Z

Can you oc get ds -n nvidia-gpu-operator and describe the vgpu-manager daemonset?

ppetko · 2023-10-03T18:00:16Z

oc get ds -n nvidia-gpu-operator 
NAME                                            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                                                                         AGE
gpu-feature-discovery                           0         0         0       0            0           nvidia.com/gpu.deploy.gpu-feature-discovery=true                                                                      90s
nvidia-container-toolkit-daemonset              0         0         0       0            0           nvidia.com/gpu.deploy.container-toolkit=true                                                                          90s
nvidia-dcgm                                     0         0         0       0            0           nvidia.com/gpu.deploy.dcgm=true                                                                                       90s
nvidia-dcgm-exporter                            0         0         0       0            0           nvidia.com/gpu.deploy.dcgm-exporter=true                                                                              90s
nvidia-device-plugin-daemonset                  0         0         0       0            0           nvidia.com/gpu.deploy.device-plugin=true                                                                              90s
nvidia-driver-daemonset-411.86.202303060052-0   0         0         0       0            0           feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=411.86.202303060052-0,nvidia.com/gpu.deploy.driver=true   90s
nvidia-mig-manager                              0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true                                                                                90s
nvidia-node-status-exporter                     0         0         0       0            0           nvidia.com/gpu.deploy.node-status-exporter=true                                                                       90s
nvidia-operator-validator                       0         0         0       0            0           nvidia.com/gpu.deploy.operator-validator=true                                                                         90s
nvidia-sandbox-device-plugin-daemonset          1         1         1       1            1           nvidia.com/gpu.deploy.sandbox-device-plugin=true                                                                      90s
nvidia-sandbox-validator                        1         1         1       1            1           nvidia.com/gpu.deploy.sandbox-validator=true                                                                          90s
nvidia-vfio-manager                             1         1         1       1            1           nvidia.com/gpu.deploy.vfio-manager=true                                                                               90s
nvidia-vgpu-device-manager                      2         2         2       2            2           nvidia.com/gpu.deploy.vgpu-device-manager=true                                                                        90s

It looks like I don't have the daemonset for the vgpu-manager, that explains why I don't any pods. I have specified this label nvidia.com/vgpu.config=A100-2-10C which I'm not sure if it's the correct one. If I leave this blank I'm getting the following

oc logs -f nvidia-vgpu-device-manager-hmqjt -c vgpu-manager-validation 
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
waiting for NVIDIA vGPU Manager to be setup
^C

I have opened this case, but not much traction https://forums.developer.nvidia.com/t/rror-getting-vgpu-config-error-getting-all-vgpu-devices-unable-to-read-mdev-devices-directory-open-sys-bus-mdev-devices-no-such-file-or-directory/267696

This is the output of all resources in the namespace

oc get all 
NAME                                               READY   STATUS    RESTARTS   AGE
pod/gpu-operator-fbb6ffcc8-gzddt                   1/1     Running   0          7d2h
pod/nvidia-sandbox-device-plugin-daemonset-62rbg   1/1     Running   0          6m29s
pod/nvidia-sandbox-validator-s9zsr                 1/1     Running   0          6m29s
pod/nvidia-vfio-manager-wjx99                      1/1     Running   0          7m5s
pod/nvidia-vgpu-device-manager-g2xsd               1/1     Running   0          7m5s
pod/nvidia-vgpu-device-manager-tzpcf               1/1     Running   0          7m5s

NAME                                  TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/gpu-operator                  ClusterIP   172.30.214.74   <none>        8080/TCP   7m5s
service/nvidia-dcgm-exporter          ClusterIP   172.30.37.127   <none>        9400/TCP   7m5s
service/nvidia-node-status-exporter   ClusterIP   172.30.62.146   <none>        8000/TCP   7m5s

NAME                                                           DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                                                                                                         AGE
daemonset.apps/gpu-feature-discovery                           0         0         0       0            0           nvidia.com/gpu.deploy.gpu-feature-discovery=true                                                                      7m5s
daemonset.apps/nvidia-container-toolkit-daemonset              0         0         0       0            0           nvidia.com/gpu.deploy.container-toolkit=true                                                                          7m5s
daemonset.apps/nvidia-dcgm                                     0         0         0       0            0           nvidia.com/gpu.deploy.dcgm=true                                                                                       7m5s
daemonset.apps/nvidia-dcgm-exporter                            0         0         0       0            0           nvidia.com/gpu.deploy.dcgm-exporter=true                                                                              7m5s
daemonset.apps/nvidia-device-plugin-daemonset                  0         0         0       0            0           nvidia.com/gpu.deploy.device-plugin=true                                                                              7m5s
daemonset.apps/nvidia-driver-daemonset-411.86.202303060052-0   0         0         0       0            0           feature.node.kubernetes.io/system-os_release.OSTREE_VERSION=411.86.202303060052-0,nvidia.com/gpu.deploy.driver=true   7m5s
daemonset.apps/nvidia-mig-manager                              0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true                                                                                7m5s
daemonset.apps/nvidia-node-status-exporter                     0         0         0       0            0           nvidia.com/gpu.deploy.node-status-exporter=true                                                                       7m5s
daemonset.apps/nvidia-operator-validator                       0         0         0       0            0           nvidia.com/gpu.deploy.operator-validator=true                                                                         7m5s
daemonset.apps/nvidia-sandbox-device-plugin-daemonset          1         1         1       1            1           nvidia.com/gpu.deploy.sandbox-device-plugin=true                                                                      7m5s
daemonset.apps/nvidia-sandbox-validator                        1         1         1       1            1           nvidia.com/gpu.deploy.sandbox-validator=true                                                                          7m5s
daemonset.apps/nvidia-vfio-manager                             1         1         1       1            1           nvidia.com/gpu.deploy.vfio-manager=true                                                                               7m5s
daemonset.apps/nvidia-vgpu-device-manager                      2         2         2       2            2           nvidia.com/gpu.deploy.vgpu-device-manager=true                                                                        7m5s

NAME                           READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/gpu-operator   1/1     1            1           7d2h

NAME                                     DESIRED   CURRENT   READY   AGE
replicaset.apps/gpu-operator-fbb6ffcc8   1         1         1       7d2h

shivamerla · 2023-10-03T18:05:00Z

This doesn't seem right, if the node is labelled as nvidia.com/gpu.workload.config=vm-vgpu then we deploy both "vgpu-manager" and "vgpu-device-manager". Here we see vfio-manager getting deployed that happens only when the workload-config is vm-passthrough. If you can share operator logs we can check why right operands are not getting deployed.

shivamerla · 2023-10-03T18:08:09Z

Ah, below section is wrong.

 vgpuManager:
    driverManager:
      image: vgpu-manager
      repository: default-route-openshift-image-registry.apps.ocp4.poc.site/pp-testing
      version: 535.104.06-rhcos4.11
    enabled: true

This should be

vgpuManager:
  enabled: true
  repository: "default-route-openshift-image-registry.apps.ocp4.poc.site/pp-testing"
  image: vgpu-manager
  version: "535.104.06-rhcos4.11"
  imagePullPolicy: IfNotPresent
  imagePullSecrets: []
  env: []
  resources: {}

ppetko · 2023-10-03T18:13:09Z

Hm interesting - this yaml was generated by the clusterpolicy install using the UI.

Look at the logs below... Let me redeploy with the correct yaml file.


{"level":"error","ts":"2023-10-03T18:10:01Z","logger":"controllers.ClusterPolicy","msg":"Failed to apply transformation","Daemonset":"nvidia-vgpu-manager-daemonset","resource":"nvidia-vgpu-manager-daemonset","error":"failed to transform vGPU Manager container: Empty image path provided through both ClusterPolicy CR and ENV VGPU_MANAGER_IMAGE"}
{"level":"info","ts":"2023-10-03T18:10:01Z","logger":"controllers.ClusterPolicy","msg":"Could not pre-process","DaemonSet":"nvidia-vgpu-manager-daemonset","Namespace":"nvidia-gpu-operator","Error":"failed to transform vGPU Manager container: Empty image path provided through both ClusterPolicy CR and ENV VGPU_MANAGER_IMAGE"}
{"level":"error","ts":"2023-10-03T18:10:01Z","msg":"Reconciler error","controller":"clusterpolicy-controller","object":{"name":"gpu-cluster-policy"},"namespace":"","name":"gpu-cluster-policy","reconcileID":"62d09b2d-b745-4df4-bf74-dda2fd3c7cf2","error":"failed to handle OpenShift Driver Toolkit Daemonset for version 411.86.202303060052-0: failed to transform vGPU Manager container: Empty image path provided through both ClusterPolicy CR and ENV VGPU_MANAGER_IMAGE"}
{"level":"error","ts":"2023-10-03T18:10:01Z","logger":"controllers.ClusterPolicy","msg":"Failed to apply transformation","Daemonset":"nvidia-vgpu-manager-daemonset","resource":"nvidia-vgpu-manager-daemonset","error":"failed to transform vGPU Manager container: Empty image path provided through both ClusterPolicy CR and ENV VGPU_MANAGER_IMAGE"}
{"level":"info","ts":"2023-10-03T18:10:01Z","logger":"controllers.ClusterPolicy","msg":"Could not pre-process","DaemonSet":"nvidia-vgpu-manager-daemonset","Namespace":"nvidia-gpu-operator","Error":"failed to transform vGPU Manager container: Empty image path provided through both ClusterPolicy CR and ENV VGPU_MANAGER_IMAGE"}
{"level":"error","ts":"2023-10-03T18:10:01Z","msg":"Reconciler error","controller":"clusterpolicy-controller","object":{"name":"gpu-cluster-policy"},"namespace":"","name":"gpu-cluster-policy","reconcileID":"62d09b2d-b745-4df4-bf74-dda2fd3c7cf2","error":"failed to handle OpenShift Driver Toolkit Daemonset for version 411.86.202303060052-0: failed to transform vGPU Manager container: Empty image path provided through both ClusterPolicy CR and ENV VGPU_MANAGER_IMAGE"}

ppetko · 2023-10-03T20:04:36Z

A little heads up in the docs would be nice that once you deploy the clusterpolicy, the operator will roll the cluster and restart each node. I see 2 new machine configs are applied and the cluster is trying to update. The problem it's stuck on a node that doesn't have a GPU. I have already loaded the kernel parameters for the GPUs using a machine config only for the nodes that contains a GPU.

What exactly are the machine configs trying to configure? Are there any docs on this process?

The kernel modules are already loaded

oc debug node/gpu1  -- chroot /host lspci -nnk -d 10de:  
Starting pod/gpu1ocp4pocsite-debug ...
To use host binaries, run `chroot /host`
0000:31:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA102GL [RTX A5000] [10de:2231] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:147e]
	Kernel driver in use: nvidia
	Kernel modules: nouveau

oc debug node/gpu3 -- chroot /host lspci -nnk -d 10de:  
Starting pod/gpu3ocp4pocsite-debug ...
To use host binaries, run `chroot /host`
1b:00.0 3D controller [0302]: NVIDIA Corporation GA100 [A100 PCIe 80GB] [10de:20b5] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:1533]
	Kernel driver in use: nvidia
	Kernel modules: nouveau
1c:00.0 3D controller [0302]: NVIDIA Corporation GA100 [A100 PCIe 80GB] [10de:20b5] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:1533]
	Kernel driver in use: nvidia
	Kernel modules: nouveau

Output of the mcp

oc get mcp 
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-bd4920bc82fa2273f8e79e3c851cba39   True      False      False      3              3                   3                     0                      194d
worker   rendered-worker-d242c91395c7a350afeeaab80b133966   False     True       True       5              3                   3                     1                      194d

On the bright side, I think the deployment is fixed. I checked the clusterpolicy UI is generating a wrong clusterpolicy using the UI in version 23.6.1 provided by NVIDIA Corporation.

oc get pods 
NAME                                                        READY   STATUS    RESTARTS   AGE
gpu-operator-fbb6ffcc8-qdd5g                                1/1     Running   0          48m
nvidia-sandbox-device-plugin-daemonset-66nzb                1/1     Running   0          40m
nvidia-sandbox-validator-v76vz                              1/1     Running   0          40m
nvidia-vfio-manager-lxqzb                                   1/1     Running   0          41m
nvidia-vgpu-device-manager-44sxj                            1/1     Running   0          14m
nvidia-vgpu-device-manager-qpdlx                            1/1     Running   0          14m
nvidia-vgpu-manager-daemonset-411.86.202303060052-0-k52dq   2/2     Running   0          14m
nvidia-vgpu-manager-daemonset-411.86.202303060052-0-pfxgp   2/2     Running   0          14m

oc logs -f nvidia-vgpu-manager-daemonset-411.86.202303060052-0-k52dq 
\Defaulted container "nvidia-vgpu-manager-ctr" out of: nvidia-vgpu-manager-ctr, openshift-driver-toolkit-ctr, k8s-driver-manager (init)
+ [[ '' == \t\r\u\e ]]
+ [[ ! -f /mnt/shared-nvidia-driver-toolkit/dir_prepared ]]
+ cp -r /usr/local/bin/ocp_dtk_entrypoint /usr/local/bin/nvidia-driver /driver /mnt/shared-nvidia-driver-toolkit/
+ env
+ sed 's/=/="/'
+ sed 's/$/"/'
+ touch /mnt/shared-nvidia-driver-toolkit/dir_prepared
+ set +x
Tue Oct  3 19:35:25 UTC 2023 Waiting for openshift-driver-toolkit-ctr container to start ...
Tue Oct  3 19:35:40 UTC 2023 openshift-driver-toolkit-ctr started.
+ sleep infinity

shivamerla · 2023-10-04T04:31:47Z

@ppetko AFAIK, we don't update MachineConfig at all from our code. What is the actual change that is being applied through MachineConfig? May be some other operator(OSV?) triggered that?

ppetko · 2023-10-04T14:39:54Z

From what I can see, as soon as we applied the correct ClusterPolicy CR, two new machine configs were created. But the configurations doesn't look related to the GPUs. So not sure what caused this.

oc get mc 
NAME                                                    GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
00-master                                               624a49edf1d0eeca83d70c58faae25516fa25e20   3.2.0             195d
00-worker                                               624a49edf1d0eeca83d70c58faae25516fa25e20   3.2.0             195d
01-master-container-runtime                             624a49edf1d0eeca83d70c58faae25516fa25e20   3.2.0             195d
01-master-kubelet                                       624a49edf1d0eeca83d70c58faae25516fa25e20   3.2.0             195d
01-worker-container-runtime                             624a49edf1d0eeca83d70c58faae25516fa25e20   3.2.0             195d
01-worker-kubelet                                       624a49edf1d0eeca83d70c58faae25516fa25e20   3.2.0             195d
100-worker-iommu                                                                                   3.2.0             194d
100-worker-vfiopci                                                                                 3.2.0             194d
50-masters-chrony-configuration                                                                    3.1.0             195d
50-workers-chrony-configuration                                                                    3.1.0             195d
99-assisted-installer-master-ssh                                                                   3.1.0             195d
99-master-generated-crio-add-inheritable-capabilities                                              3.2.0             195d
99-master-generated-registries                          624a49edf1d0eeca83d70c58faae25516fa25e20   3.2.0             195d
99-master-ssh                                                                                      3.2.0             195d
99-worker-generated-crio-add-inheritable-capabilities                                              3.2.0             195d
99-worker-generated-registries                          624a49edf1d0eeca83d70c58faae25516fa25e20   3.2.0             195d
99-worker-ssh                                                                                      3.2.0             195d
rendered-master-4601510310247f17c4b2ee3ada9ca54f        624a49edf1d0eeca83d70c58faae25516fa25e20   3.2.0             195d
rendered-master-bd4920bc82fa2273f8e79e3c851cba39        624a49edf1d0eeca83d70c58faae25516fa25e20   3.2.0             19h
rendered-worker-06a98033c5f02d42ff75208c7b1db70c        624a49edf1d0eeca83d70c58faae25516fa25e20   3.2.0             19h
rendered-worker-20a3cea1f4b3d262015faf2610a652e1        624a49edf1d0eeca83d70c58faae25516fa25e20   3.2.0             195d
rendered-worker-a915ba541d8df2a6741b2f8507ea3928        624a49edf1d0eeca83d70c58faae25516fa25e20   3.2.0             194d
rendered-worker-d242c91395c7a350afeeaab80b133966        624a49edf1d0eeca83d70c58faae25516fa25e20   3.2.0             194d

Now the worker machine config pool is in degraded state.

oc get mcp 
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-bd4920bc82fa2273f8e79e3c851cba39   True      False      False      3              3                   3                     0                      195d
worker   rendered-worker-d242c91395c7a350afeeaab80b133966   False     True       True       5              3                   3                     1                      195d

I will create a smaller cluster with GPU nodes only and then I will attempt the installation again. Thank you.

shivamerla · 2023-10-19T05:30:58Z

@fabiendupont any idea why the machineconfig got updated in this case?

fabiendupont · 2023-10-19T05:53:28Z

I don't see an obvious reason. It could be that the MachineConfigPool node selector uses labels created by NVIDIA GPU Operator.

@ppetko, can you describe the MachineConfigPool ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error getting vGPU config: error getting all vGPU devices: unable to read MDEV devices directory: open /sys/bus/mdev/devices: no such file or directory #591

error getting vGPU config: error getting all vGPU devices: unable to read MDEV devices directory: open /sys/bus/mdev/devices: no such file or directory #591

ppetko commented Sep 28, 2023 •

edited

Loading

shivamerla commented Sep 28, 2023

ppetko commented Oct 2, 2023

cdesiniotis commented Oct 2, 2023

ppetko commented Oct 3, 2023 •

edited

Loading

cdesiniotis commented Oct 3, 2023

ppetko commented Oct 3, 2023

shivamerla commented Oct 3, 2023

ppetko commented Oct 3, 2023

shivamerla commented Oct 3, 2023

shivamerla commented Oct 3, 2023 •

edited

Loading

ppetko commented Oct 3, 2023

ppetko commented Oct 3, 2023 •

edited

Loading

shivamerla commented Oct 4, 2023

ppetko commented Oct 4, 2023

shivamerla commented Oct 19, 2023

fabiendupont commented Oct 19, 2023

error getting vGPU config: error getting all vGPU devices: unable to read MDEV devices directory: open /sys/bus/mdev/devices: no such file or directory #591

error getting vGPU config: error getting all vGPU devices: unable to read MDEV devices directory: open /sys/bus/mdev/devices: no such file or directory #591

Comments

ppetko commented Sep 28, 2023 • edited Loading

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

4. Debug info

shivamerla commented Sep 28, 2023

ppetko commented Oct 2, 2023

cdesiniotis commented Oct 2, 2023

ppetko commented Oct 3, 2023 • edited Loading

cdesiniotis commented Oct 3, 2023

ppetko commented Oct 3, 2023

shivamerla commented Oct 3, 2023

ppetko commented Oct 3, 2023

This is the output of all resources in the namespace

shivamerla commented Oct 3, 2023

shivamerla commented Oct 3, 2023 • edited Loading

ppetko commented Oct 3, 2023

ppetko commented Oct 3, 2023 • edited Loading

The kernel modules are already loaded

Output of the mcp

shivamerla commented Oct 4, 2023

ppetko commented Oct 4, 2023

shivamerla commented Oct 19, 2023

fabiendupont commented Oct 19, 2023

ppetko commented Sep 28, 2023 •

edited

Loading

ppetko commented Oct 3, 2023 •

edited

Loading

shivamerla commented Oct 3, 2023 •

edited

Loading

ppetko commented Oct 3, 2023 •

edited

Loading