Skip to content

Unable to run pod on G5 48xlarge instance, other g5 instance works well #634

Open
@arpitsharma-vw

Description

@arpitsharma-vw

1. Quick Debug Information

  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): CRI-O
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): Openshift v4.13.23
  • GPU Operator Version: 23.9.1 , gpu-operator-certified.v1.11.1

2. Issue or feature description

We have openshift cluster where we have installed nvidia gpu operator. When we run any pod on G5.48xlarge machine, we get error as

Allocate failed due to device plugin GetPreferredAllocation rpc failed with err: rpc error: code = Unknown desc = error getting list of preferred allocation devices: unable to retrieve list of available devices: error creating nvml.Device 0: nvml: Unknown Error, which is unexpected

Same pod on other machine like g5.4xlarge,g5.12xlarge works well. We see this behaviour recently. Earlier same pod worked on g5.48xlarge instance.

We also see pod from nvidia-dcgm-exporter is failing with following error:

(combined from similar events): Error: container create failed: time="2023-12-13T11:17:30Z" level=error msg="runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: detection error: nvml error: unknown error\n"
Error: container create failed: time="2023-12-13T10:29:12Z" level=error msg="runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: driver rpc error: timed out\n"

3. Steps to reproduce the issue

Assign pod on g5.48xlarge works, but it doesn't run

4. Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
  • kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
  • If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
  • If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
  • Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
  • containerd logs journalctl -u containerd > containerd.log

Logs from nvidia-dcgm-exporter pod

time="2023-12-13T10:25:04Z" level=info msg="Starting dcgm-exporter"
time="2023-12-13T10:25:04Z" level=info msg="Attemping to connect to remote hostengine at localhost:5555"
time="2023-12-13T10:25:04Z" level=info msg="DCGM successfully initialized!"
time="2023-12-13T10:25:05Z" level=info msg="Collecting DCP Metrics"
time="2023-12-13T10:25:05Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/dcgm-metrics.csv"
time="2023-12-13T10:25:05Z" level=info msg="Initializing system entities of type: GPU"
time="2023-12-13T10:25:30Z" level=fatal msg="Failed to watch metrics: Error watching fields: The third-party Profiling module returned an unrecoverable error"

Logs from GPU feature discovery pod:

I1213 10:24:42.754239       1 main.go:122] Starting OS watcher.
I1213 10:24:42.754459       1 main.go:127] Loading configuration.
I1213 10:24:42.754781       1 main.go:139] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "single",
    "failOnInitError": true,
    "gdsEnabled": null,
    "mofedEnabled": null,
    "gfd": {
      "oneshot": false,
      "noTimestamp": false,
      "sleepInterval": "1m0s",
      "outputFile": "/etc/kubernetes/node-feature-discovery/features.d/gfd",
      "machineTypeFile": "/sys/class/dmi/id/product_name"
    }
  },
  "resources": {
    "gpus": null
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I1213 10:24:42.755227       1 factory.go:48] Detected NVML platform: found NVML library
I1213 10:24:42.755282       1 factory.go:48] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I1213 10:24:42.755294       1 factory.go:64] Using NVML manager
I1213 10:24:42.755301       1 main.go:144] Start running
I1213 10:24:43.018503       1 main.go:187] Creating Labels
2023/12/13 10:24:43 Writing labels to output file /etc/kubernetes/node-feature-discovery/features.d/gfd
I1213 10:24:43.018687       1 main.go:197] Sleeping for 60000000000
I1213 10:29:12.978389       1 main.go:119] Exiting
E1213 10:29:12.978748       1 main.go:95] error creating NVML labeler: error creating mig capability labeler: error getting mig capability: error getting MIG mode: Unknown Error

GPU cluster policy

apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
  creationTimestamp: '2023-10-02T12:22:11Z'
  generation: 1
  managedFields:
    - apiVersion: nvidia.com/v1
      fieldsType: FieldsV1
      fieldsV1:
        'f:spec':
          'f:gds':
            .: {}
            'f:enabled': {}
          'f:vgpuManager':
            .: {}
            'f:enabled': {}
          'f:vfioManager':
            .: {}
            'f:enabled': {}
          'f:daemonsets':
            .: {}
            'f:rollingUpdate':
              .: {}
              'f:maxUnavailable': {}
            'f:updateStrategy': {}
          'f:sandboxWorkloads':
            .: {}
            'f:defaultWorkload': {}
            'f:enabled': {}
          'f:nodeStatusExporter':
            .: {}
            'f:enabled': {}
          'f:toolkit':
            .: {}
            'f:enabled': {}
            'f:installDir': {}
          'f:vgpuDeviceManager':
            .: {}
            'f:enabled': {}
          .: {}
          'f:gfd':
            .: {}
            'f:enabled': {}
          'f:migManager':
            .: {}
            'f:enabled': {}
          'f:mig':
            .: {}
            'f:strategy': {}
          'f:operator':
            .: {}
            'f:defaultRuntime': {}
            'f:initContainer': {}
            'f:runtimeClass': {}
            'f:use_ocp_driver_toolkit': {}
          'f:dcgm':
            .: {}
            'f:enabled': {}
          'f:dcgmExporter':
            .: {}
            'f:config':
              .: {}
              'f:name': {}
            'f:enabled': {}
            'f:serviceMonitor':
              .: {}
              'f:enabled': {}
          'f:sandboxDevicePlugin':
            .: {}
            'f:enabled': {}
          'f:driver':
            .: {}
            'f:certConfig':
              .: {}
              'f:name': {}
            'f:enabled': {}
            'f:kernelModuleConfig':
              .: {}
              'f:name': {}
            'f:licensingConfig':
              .: {}
              'f:configMapName': {}
              'f:nlsEnabled': {}
            'f:repoConfig':
              .: {}
              'f:configMapName': {}
            'f:upgradePolicy':
              .: {}
              'f:autoUpgrade': {}
              'f:drain':
                .: {}
                'f:deleteEmptyDir': {}
                'f:enable': {}
                'f:force': {}
                'f:timeoutSeconds': {}
              'f:maxParallelUpgrades': {}
              'f:maxUnavailable': {}
              'f:podDeletion':
                .: {}
                'f:deleteEmptyDir': {}
                'f:force': {}
                'f:timeoutSeconds': {}
              'f:waitForCompletion':
                .: {}
                'f:timeoutSeconds': {}
            'f:virtualTopology':
              .: {}
              'f:config': {}
          'f:devicePlugin':
            .: {}
            'f:config':
              .: {}
              'f:default': {}
              'f:name': {}
            'f:enabled': {}
          'f:validator':
            .: {}
            'f:plugin':
              .: {}
              'f:env': {}
      manager: Mozilla
      operation: Update
      time: '2023-10-02T12:22:11Z'
    - apiVersion: nvidia.com/v1
      fieldsType: FieldsV1
      fieldsV1:
        'f:status':
          .: {}
          'f:namespace': {}
      manager: Go-http-client
      operation: Update
      subresource: status
      time: '2023-12-11T14:01:59Z'
    - apiVersion: nvidia.com/v1
      fieldsType: FieldsV1
      fieldsV1:
        'f:status':
          'f:conditions': {}
          'f:state': {}
      manager: gpu-operator
      operation: Update
      subresource: status
      time: '2023-12-13T10:29:14Z'
  name: gpu-cluster-policy
  resourceVersion: '1243373036'
  uid: 1e79d1d1-cfc8-493f-bad0-4a94fa0a2da7
spec:
  vgpuDeviceManager:
    enabled: true
  migManager:
    enabled: true
  operator:
    defaultRuntime: crio
    initContainer: {}
    runtimeClass: nvidia
    use_ocp_driver_toolkit: true
  dcgm:
    enabled: true
  gfd:
    enabled: true
  dcgmExporter:
    config:
      name: console-plugin-nvidia-gpu
    enabled: true
    serviceMonitor:
      enabled: true
  driver:
    certConfig:
      name: ''
    enabled: true
    kernelModuleConfig:
      name: ''
    licensingConfig:
      configMapName: ''
      nlsEnabled: false
    repoConfig:
      configMapName: ''
    upgradePolicy:
      autoUpgrade: true
      drain:
        deleteEmptyDir: false
        enable: false
        force: false
        timeoutSeconds: 300
      maxParallelUpgrades: 1
      maxUnavailable: 25%
      podDeletion:
        deleteEmptyDir: false
        force: false
        timeoutSeconds: 300
      waitForCompletion:
        timeoutSeconds: 0
    virtualTopology:
      config: ''
  devicePlugin:
    config:
      default: ''
      name: ''
    enabled: true
  mig:
    strategy: single
  sandboxDevicePlugin:
    enabled: true
  validator:
    plugin:
      env:
        - name: WITH_WORKLOAD
          value: 'false'
  nodeStatusExporter:
    enabled: true
  daemonsets:
    rollingUpdate:
      maxUnavailable: '1'
    updateStrategy: RollingUpdate
  sandboxWorkloads:
    defaultWorkload: container
    enabled: false
  gds:
    enabled: false
  vgpuManager:
    enabled: false
  vfioManager:
    enabled: true
  toolkit:
    enabled: true
    installDir: /usr/local/nvidia
status:
  conditions:
    - lastTransitionTime: '2023-12-13T10:25:37Z'
      message: ''
      reason: Error
      status: 'False'
      type: Ready
    - lastTransitionTime: '2023-12-13T10:25:37Z'
      message: >-
        ClusterPolicy is not ready, states not ready: [state-dcgm-exporter
        gpu-feature-discovery]
      reason: OperandNotReady
      status: 'True'
      type: Error
  namespace: nvidia-gpu-operator
  state: notReady

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions