InvalidImageName when specifying the driver version in the spec.driver.version property #585

koflerm · 2023-09-23T16:20:49Z

1. Quick Debug Information

Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): CRI-O
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): OCP 4.11.43
GPU Operator Version: 23.6.1

2. Issue or feature description

When specifying a driver version in the ClusterPolicy object in the spec.driver.version property, the driver-daemonset pods are crashing with the error "InvalidImageName". This is caused by the image used for the driver container, which has the following format: "image: '/:525.60.13-rhcos4.11'". Obviously, the registry, repository and image is missing in the reference. The image value should be set to "nvcr.io/nvidia/driver:525.60.13-rhcos4.11".

3. Steps to reproduce the issue

Set the spec.driver.version property of the ClusterPolicy CR.

4. Information to attach (optional if deemed irrelevant)

[x ] ClusterPolicy YAML

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  vgpuDeviceManager:
    config:
      default: default
    enabled: true
  migManager:
    enabled: true
  operator:
    defaultRuntime: crio
    initContainer: {}
    runtimeClass: nvidia
    use_ocp_driver_toolkit: true
  dcgm:
    enabled: true
  gfd:
    enabled: true
  dcgmExporter:
    config:
      name: nvidia-dcgm-exporter-custom-config
    enabled: true
    serviceMonitor:
      enabled: true
  driver:
    certConfig:
      name: ''
    enabled: true
    kernelModuleConfig:
      name: ''
    licensingConfig:
      configMapName: ''
      nlsEnabled: false
    repoConfig:
      configMapName: ''
    version: 525.60.13
    virtualTopology:
      config: ''
  devicePlugin:
    config:
      default: ''
      name: ''
    enabled: true
    env:
      - name: DEVICE_LIST_STRATEGY
        value: volume-mounts
  mig:
    strategy: mixed
  sandboxDevicePlugin:
    enabled: true
  validator:
    plugin:
      env:
        - name: WITH_WORKLOAD
          value: 'true'
  nodeStatusExporter:
    enabled: true
  daemonsets:
    updateStrategy: RollingUpdate
  sandboxWorkloads:
    defaultWorkload: container
    enabled: false
  vgpuManager:
    enabled: false
  vfioManager:
    enabled: true
  toolkit:
    enabled: true
    env:
      - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
        value: 'false'
      - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
        value: 'true'
    installDir: /usr/local/nvidia
status:
  namespace: nvidia-gpu-operator
  state: notReady

The text was updated successfully, but these errors were encountered:

koflerm · 2023-10-18T12:48:34Z

@shivamerla are you able to help here?

shivamerla · 2023-10-19T05:23:52Z

@koflerm please also specify driver.repository=nvcr.io/nvidia and driver.image=driver. all three fields are required in clusterpolicy when overriding.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

InvalidImageName when specifying the driver version in the spec.driver.version property #585

InvalidImageName when specifying the driver version in the spec.driver.version property #585

koflerm commented Sep 23, 2023

koflerm commented Oct 18, 2023

shivamerla commented Oct 19, 2023

InvalidImageName when specifying the driver version in the spec.driver.version property #585

InvalidImageName when specifying the driver version in the spec.driver.version property #585

Comments

koflerm commented Sep 23, 2023

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

4. Information to attach (optional if deemed irrelevant)

koflerm commented Oct 18, 2023

shivamerla commented Oct 19, 2023