Open
Description
1. Quick Debug Information
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): CRI-O
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): Openshift v4.13.23
- GPU Operator Version: 23.9.1 , gpu-operator-certified.v1.11.1
2. Issue or feature description
We have openshift cluster where we have installed nvidia gpu operator. When we run any pod on G5.48xlarge machine, we get error as
Allocate failed due to device plugin GetPreferredAllocation rpc failed with err: rpc error: code = Unknown desc = error getting list of preferred allocation devices: unable to retrieve list of available devices: error creating nvml.Device 0: nvml: Unknown Error, which is unexpected
Same pod on other machine like g5.4xlarge,g5.12xlarge works well. We see this behaviour recently. Earlier same pod worked on g5.48xlarge instance.
We also see pod from nvidia-dcgm-exporter is failing with following error:
(combined from similar events): Error: container create failed: time="2023-12-13T11:17:30Z" level=error msg="runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: detection error: nvml error: unknown error\n"
Error: container create failed: time="2023-12-13T10:29:12Z" level=error msg="runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli.real: initialization error: driver rpc error: timed out\n"
3. Steps to reproduce the issue
Assign pod on g5.48xlarge works, but it doesn't run
4. Information to attach (optional if deemed irrelevant)
- kubernetes pods status:
kubectl get pods -n OPERATOR_NAMESPACE
- kubernetes daemonset status:
kubectl get ds -n OPERATOR_NAMESPACE
- If a pod/ds is in an error state or pending state
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
- If a pod/ds is in an error state or pending state
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
- Output from running
nvidia-smi
from the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
- containerd logs
journalctl -u containerd > containerd.log
Logs from nvidia-dcgm-exporter pod
time="2023-12-13T10:25:04Z" level=info msg="Starting dcgm-exporter"
time="2023-12-13T10:25:04Z" level=info msg="Attemping to connect to remote hostengine at localhost:5555"
time="2023-12-13T10:25:04Z" level=info msg="DCGM successfully initialized!"
time="2023-12-13T10:25:05Z" level=info msg="Collecting DCP Metrics"
time="2023-12-13T10:25:05Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/dcgm-metrics.csv"
time="2023-12-13T10:25:05Z" level=info msg="Initializing system entities of type: GPU"
time="2023-12-13T10:25:30Z" level=fatal msg="Failed to watch metrics: Error watching fields: The third-party Profiling module returned an unrecoverable error"
Logs from GPU feature discovery pod:
I1213 10:24:42.754239 1 main.go:122] Starting OS watcher.
I1213 10:24:42.754459 1 main.go:127] Loading configuration.
I1213 10:24:42.754781 1 main.go:139]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "single",
"failOnInitError": true,
"gdsEnabled": null,
"mofedEnabled": null,
"gfd": {
"oneshot": false,
"noTimestamp": false,
"sleepInterval": "1m0s",
"outputFile": "/etc/kubernetes/node-feature-discovery/features.d/gfd",
"machineTypeFile": "/sys/class/dmi/id/product_name"
}
},
"resources": {
"gpus": null
},
"sharing": {
"timeSlicing": {}
}
}
I1213 10:24:42.755227 1 factory.go:48] Detected NVML platform: found NVML library
I1213 10:24:42.755282 1 factory.go:48] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I1213 10:24:42.755294 1 factory.go:64] Using NVML manager
I1213 10:24:42.755301 1 main.go:144] Start running
I1213 10:24:43.018503 1 main.go:187] Creating Labels
2023/12/13 10:24:43 Writing labels to output file /etc/kubernetes/node-feature-discovery/features.d/gfd
I1213 10:24:43.018687 1 main.go:197] Sleeping for 60000000000
I1213 10:29:12.978389 1 main.go:119] Exiting
E1213 10:29:12.978748 1 main.go:95] error creating NVML labeler: error creating mig capability labeler: error getting mig capability: error getting MIG mode: Unknown Error
GPU cluster policy
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
creationTimestamp: '2023-10-02T12:22:11Z'
generation: 1
managedFields:
- apiVersion: nvidia.com/v1
fieldsType: FieldsV1
fieldsV1:
'f:spec':
'f:gds':
.: {}
'f:enabled': {}
'f:vgpuManager':
.: {}
'f:enabled': {}
'f:vfioManager':
.: {}
'f:enabled': {}
'f:daemonsets':
.: {}
'f:rollingUpdate':
.: {}
'f:maxUnavailable': {}
'f:updateStrategy': {}
'f:sandboxWorkloads':
.: {}
'f:defaultWorkload': {}
'f:enabled': {}
'f:nodeStatusExporter':
.: {}
'f:enabled': {}
'f:toolkit':
.: {}
'f:enabled': {}
'f:installDir': {}
'f:vgpuDeviceManager':
.: {}
'f:enabled': {}
.: {}
'f:gfd':
.: {}
'f:enabled': {}
'f:migManager':
.: {}
'f:enabled': {}
'f:mig':
.: {}
'f:strategy': {}
'f:operator':
.: {}
'f:defaultRuntime': {}
'f:initContainer': {}
'f:runtimeClass': {}
'f:use_ocp_driver_toolkit': {}
'f:dcgm':
.: {}
'f:enabled': {}
'f:dcgmExporter':
.: {}
'f:config':
.: {}
'f:name': {}
'f:enabled': {}
'f:serviceMonitor':
.: {}
'f:enabled': {}
'f:sandboxDevicePlugin':
.: {}
'f:enabled': {}
'f:driver':
.: {}
'f:certConfig':
.: {}
'f:name': {}
'f:enabled': {}
'f:kernelModuleConfig':
.: {}
'f:name': {}
'f:licensingConfig':
.: {}
'f:configMapName': {}
'f:nlsEnabled': {}
'f:repoConfig':
.: {}
'f:configMapName': {}
'f:upgradePolicy':
.: {}
'f:autoUpgrade': {}
'f:drain':
.: {}
'f:deleteEmptyDir': {}
'f:enable': {}
'f:force': {}
'f:timeoutSeconds': {}
'f:maxParallelUpgrades': {}
'f:maxUnavailable': {}
'f:podDeletion':
.: {}
'f:deleteEmptyDir': {}
'f:force': {}
'f:timeoutSeconds': {}
'f:waitForCompletion':
.: {}
'f:timeoutSeconds': {}
'f:virtualTopology':
.: {}
'f:config': {}
'f:devicePlugin':
.: {}
'f:config':
.: {}
'f:default': {}
'f:name': {}
'f:enabled': {}
'f:validator':
.: {}
'f:plugin':
.: {}
'f:env': {}
manager: Mozilla
operation: Update
time: '2023-10-02T12:22:11Z'
- apiVersion: nvidia.com/v1
fieldsType: FieldsV1
fieldsV1:
'f:status':
.: {}
'f:namespace': {}
manager: Go-http-client
operation: Update
subresource: status
time: '2023-12-11T14:01:59Z'
- apiVersion: nvidia.com/v1
fieldsType: FieldsV1
fieldsV1:
'f:status':
'f:conditions': {}
'f:state': {}
manager: gpu-operator
operation: Update
subresource: status
time: '2023-12-13T10:29:14Z'
name: gpu-cluster-policy
resourceVersion: '1243373036'
uid: 1e79d1d1-cfc8-493f-bad0-4a94fa0a2da7
spec:
vgpuDeviceManager:
enabled: true
migManager:
enabled: true
operator:
defaultRuntime: crio
initContainer: {}
runtimeClass: nvidia
use_ocp_driver_toolkit: true
dcgm:
enabled: true
gfd:
enabled: true
dcgmExporter:
config:
name: console-plugin-nvidia-gpu
enabled: true
serviceMonitor:
enabled: true
driver:
certConfig:
name: ''
enabled: true
kernelModuleConfig:
name: ''
licensingConfig:
configMapName: ''
nlsEnabled: false
repoConfig:
configMapName: ''
upgradePolicy:
autoUpgrade: true
drain:
deleteEmptyDir: false
enable: false
force: false
timeoutSeconds: 300
maxParallelUpgrades: 1
maxUnavailable: 25%
podDeletion:
deleteEmptyDir: false
force: false
timeoutSeconds: 300
waitForCompletion:
timeoutSeconds: 0
virtualTopology:
config: ''
devicePlugin:
config:
default: ''
name: ''
enabled: true
mig:
strategy: single
sandboxDevicePlugin:
enabled: true
validator:
plugin:
env:
- name: WITH_WORKLOAD
value: 'false'
nodeStatusExporter:
enabled: true
daemonsets:
rollingUpdate:
maxUnavailable: '1'
updateStrategy: RollingUpdate
sandboxWorkloads:
defaultWorkload: container
enabled: false
gds:
enabled: false
vgpuManager:
enabled: false
vfioManager:
enabled: true
toolkit:
enabled: true
installDir: /usr/local/nvidia
status:
conditions:
- lastTransitionTime: '2023-12-13T10:25:37Z'
message: ''
reason: Error
status: 'False'
type: Ready
- lastTransitionTime: '2023-12-13T10:25:37Z'
message: >-
ClusterPolicy is not ready, states not ready: [state-dcgm-exporter
gpu-feature-discovery]
reason: OperandNotReady
status: 'True'
type: Error
namespace: nvidia-gpu-operator
state: notReady
Collecting full debug bundle (optional):
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: [email protected]
Metadata
Metadata
Assignees
Labels
No labels