Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InvalidImageName when specifying the driver version in the spec.driver.version property #585

Open
koflerm opened this issue Sep 23, 2023 · 2 comments

Comments

@koflerm
Copy link

koflerm commented Sep 23, 2023

1. Quick Debug Information

  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): CRI-O
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): OCP 4.11.43
  • GPU Operator Version: 23.6.1

2. Issue or feature description

When specifying a driver version in the ClusterPolicy object in the spec.driver.version property, the driver-daemonset pods are crashing with the error "InvalidImageName". This is caused by the image used for the driver container, which has the following format: "image: '/:525.60.13-rhcos4.11'". Obviously, the registry, repository and image is missing in the reference. The image value should be set to "nvcr.io/nvidia/driver:525.60.13-rhcos4.11".

3. Steps to reproduce the issue

  • Set the spec.driver.version property of the ClusterPolicy CR.

4. Information to attach (optional if deemed irrelevant)

  • [x ] ClusterPolicy YAML
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  name: gpu-cluster-policy
spec:
  vgpuDeviceManager:
    config:
      default: default
    enabled: true
  migManager:
    enabled: true
  operator:
    defaultRuntime: crio
    initContainer: {}
    runtimeClass: nvidia
    use_ocp_driver_toolkit: true
  dcgm:
    enabled: true
  gfd:
    enabled: true
  dcgmExporter:
    config:
      name: nvidia-dcgm-exporter-custom-config
    enabled: true
    serviceMonitor:
      enabled: true
  driver:
    certConfig:
      name: ''
    enabled: true
    kernelModuleConfig:
      name: ''
    licensingConfig:
      configMapName: ''
      nlsEnabled: false
    repoConfig:
      configMapName: ''
    version: 525.60.13
    virtualTopology:
      config: ''
  devicePlugin:
    config:
      default: ''
      name: ''
    enabled: true
    env:
      - name: DEVICE_LIST_STRATEGY
        value: volume-mounts
  mig:
    strategy: mixed
  sandboxDevicePlugin:
    enabled: true
  validator:
    plugin:
      env:
        - name: WITH_WORKLOAD
          value: 'true'
  nodeStatusExporter:
    enabled: true
  daemonsets:
    updateStrategy: RollingUpdate
  sandboxWorkloads:
    defaultWorkload: container
    enabled: false
  vgpuManager:
    enabled: false
  vfioManager:
    enabled: true
  toolkit:
    enabled: true
    env:
      - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_ENVVAR_WHEN_UNPRIVILEGED
        value: 'false'
      - name: ACCEPT_NVIDIA_VISIBLE_DEVICES_AS_VOLUME_MOUNTS
        value: 'true'
    installDir: /usr/local/nvidia
status:
  namespace: nvidia-gpu-operator
  state: notReady
@koflerm
Copy link
Author

koflerm commented Oct 18, 2023

@shivamerla are you able to help here?

@shivamerla
Copy link
Contributor

@koflerm please also specify driver.repository=nvcr.io/nvidia and driver.image=driver. all three fields are required in clusterpolicy when overriding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants