You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Iam trying to install gpu operator using helm. During install, driver pod(nvidia-driver-daemonset-fwcvl) fails with below error
below are pod logs -- omitted the initial part and added only error logs.
'[' '' '!=' builtin ']'
Updating the package cache...
echo 'Updating the package cache...'
yum -q makecache
Error: Failed to download metadata for repo 'rhel-8-for-x86_64-appstream-rpms': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried
FATAL: failed to reach RHEL package repositories. Ensure that the cluster can access the proper networks.
echo 'FATAL: failed to reach RHEL package repositories. ' 'Ensure that the cluster can access the proper networks.'
Warning BackOff 3m53s (x350 over 87m) kubelet Back-off restarting failed container nvidia-driver-ctr in pod nvidia-driver-daemonset-fwcvl_gpu-operator(1ab5bc39-dd70-411f-9592-a6b5b69ff723)
any help on this issue will be very much appreciated
The text was updated successfully, but these errors were encountered:
aneesh786
changed the title
gpu-operator install fails
gpu-operator install fails driver pod errors 'Failed to download metadata for repo 'rhel-8-for-x86_64-appstream-rpms''
Nov 24, 2023
aneesh786
changed the title
gpu-operator install fails driver pod errors 'Failed to download metadata for repo 'rhel-8-for-x86_64-appstream-rpms''
gpu-operator install fails with driver pod errors 'Failed to download metadata for repo 'rhel-8-for-x86_64-appstream-rpms''
Nov 24, 2023
1. Quick Debug Information
2. Issue or feature description
Iam trying to install gpu operator using helm. During install, driver pod(nvidia-driver-daemonset-fwcvl) fails with below error
below are pod logs -- omitted the initial part and added only error logs.
Updating the package cache...
Error: Failed to download metadata for repo 'rhel-8-for-x86_64-appstream-rpms': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried
FATAL: failed to reach RHEL package repositories. Ensure that the cluster can access the proper networks.
kubernetes pods status:
kubectl get pods -n gpu-operator
gpu-feature-discovery-zqm9h 0/1 Init:0/1 0 86m
gpu-operator-1700756391-node-feature-discovery-gc-5c546559bfmj2 1/1 Running 0 93m
gpu-operator-1700756391-node-feature-discovery-master-79796bzcb 1/1 Running 0 93m
gpu-operator-1700756391-node-feature-discovery-worker-6ddld 1/1 Running 0 93m
gpu-operator-1700756391-node-feature-discovery-worker-8c2k4 1/1 Running 0 93m
gpu-operator-1700756391-node-feature-discovery-worker-nzd7b 1/1 Running 0 93m
gpu-operator-1700756391-node-feature-discovery-worker-x8nx9 1/1 Running 0 93m
gpu-operator-68d85f45d-v97fz 1/1 Running 0 93m
nvidia-container-toolkit-daemonset-kqmtx 0/1 Init:0/1 0 86m
nvidia-dcgm-exporter-5ncg7 0/1 Init:0/1 0 86m
nvidia-device-plugin-daemonset-qmvhc 0/1 Init:0/1 0 86m
nvidia-driver-daemonset-fwcvl 0/1 CrashLoopBackOff 19 (3m20s ago) 87m
nvidia-operator-validator-vcztn 0/1 Init:0/4 0 86m
kubernetes daemonset status:
kubectl get ds -n OPERATOR_NAMESPACE
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
gpu-feature-discovery 1 1 0 1 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 94m
gpu-operator-1700756391-node-feature-discovery-worker 4 4 4 4 4 94m
nvidia-container-toolkit-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.container-toolkit=true 94m
nvidia-dcgm-exporter 1 1 0 1 0 nvidia.com/gpu.deploy.dcgm-exporter=true 94m
nvidia-device-plugin-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.device-plugin=true 94m
nvidia-driver-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.driver=true 94m
nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 94m
nvidia-operator-validator 1 1 0 1 0 nvidia.com/gpu.deploy.operator-validator=true 94m
If a pod/ds is in an error state or pending state
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
k describe po nvidia-driver-daemonset-fwcvl
Name: nvidia-driver-daemonset-fwcvl
Namespace: gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: nvidia-driver
Node: lab-worker-4/172.21.1.70
Start Time: Thu, 23 Nov 2023 11:26:21 -0500
Labels: app=nvidia-driver-daemonset
app.kubernetes.io/component=nvidia-driver
app.kubernetes.io/managed-by=gpu-operator
controller-revision-hash=5954d75477
helm.sh/chart=gpu-operator-v23.9.0
nvidia.com/precompiled=false
pod-template-generation=1
Annotations: cni.projectcalico.org/containerID: 14eb92fe162f5d1ddcf0d32343f0815ae1325dfca8eb88354d979f7cbc335c5d
cni.projectcalico.org/podIP: 192.168.148.114/32
cni.projectcalico.org/podIPs: 192.168.148.114/32
kubectl.kubernetes.io/default-container: nvidia-driver-ctr
Status: Running
IP: 192.168.148.114
IPs:
IP: 192.168.148.114
Controlled By: DaemonSet/nvidia-driver-daemonset
Init Containers:
k8s-driver-manager:
Container ID: cri-o://b15e393c5603042c1938c49f132a706332ba76bb21dab6ea2d50a0fe2a0cf3b3
Image: nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.4
Image ID: nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:5ca81f4f7e55f7b304dbbb7aaa235fca2656789145e4b34f47a7ab7079704dc7
Port:
Host Port:
Command:
driver-manager
Args:
uninstall_driver
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 23 Nov 2023 11:26:22 -0500
Finished: Thu, 23 Nov 2023 11:26:54 -0500
Ready: True
Restart Count: 0
Environment:
NODE_NAME: (v1:spec.nodeName)
NVIDIA_VISIBLE_DEVICES: void
ENABLE_GPU_POD_EVICTION: true
ENABLE_AUTO_DRAIN: false
DRAIN_USE_FORCE: false
DRAIN_POD_SELECTOR_LABEL:
DRAIN_TIMEOUT_SECONDS: 0s
DRAIN_DELETE_EMPTYDIR_DATA: false
OPERATOR_NAMESPACE: gpu-operator (v1:metadata.namespace)
Mounts:
/host from host-root (ro)
/run/nvidia from run-nvidia (rw)
/sys from host-sys (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qphz2 (ro)
Containers:
nvidia-driver-ctr:
Container ID: cri-o://8139fed89018b0c4382884f44dfa1f7146711824baf3029b9b8b416e4e91c9f5
Image: nvcr.io/nvidia/driver:525.125.06-rhel8.6
Image ID: nvcr.io/nvidia/driver@sha256:b58167d31d34784cd7c425961234d67c5e2d22eb4a5312681d0337dae812f746
Port:
Host Port:
Command:
nvidia-driver
Args:
init
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Thu, 23 Nov 2023 12:49:50 -0500
Finished: Thu, 23 Nov 2023 12:50:24 -0500
Ready: False
Restart Count: 19
Startup: exec [sh -c nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready] delay=60s timeout=60s period=10s #success=1 #failure=120
Environment:
Mounts:
/dev/log from dev-log (rw)
/host-etc/os-release from host-os-release (ro)
/run/mellanox/drivers from run-mellanox-drivers (rw)
/run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)
/run/nvidia from run-nvidia (rw)
/run/nvidia-topologyd from run-nvidia-topologyd (rw)
/var/log from var-log (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qphz2 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
run-nvidia:
Type: HostPath (bare host directory volume)
Path: /run/nvidia
HostPathType: DirectoryOrCreate
var-log:
Type: HostPath (bare host directory volume)
Path: /var/log
HostPathType:
dev-log:
Type: HostPath (bare host directory volume)
Path: /dev/log
HostPathType:
host-os-release:
Type: HostPath (bare host directory volume)
Path: /etc/os-release
HostPathType:
run-nvidia-topologyd:
Type: HostPath (bare host directory volume)
Path: /run/nvidia-topologyd
HostPathType: DirectoryOrCreate
mlnx-ofed-usr-src:
Type: HostPath (bare host directory volume)
Path: /run/mellanox/drivers/usr/src
HostPathType: DirectoryOrCreate
run-mellanox-drivers:
Type: HostPath (bare host directory volume)
Path: /run/mellanox/drivers
HostPathType: DirectoryOrCreate
run-nvidia-validations:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/validations
HostPathType: DirectoryOrCreate
host-root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
host-sys:
Type: HostPath (bare host directory volume)
Path: /sys
HostPathType: Directory
kube-api-access-qphz2:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.driver=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message
Warning BackOff 3m53s (x350 over 87m) kubelet Back-off restarting failed container nvidia-driver-ctr in pod nvidia-driver-daemonset-fwcvl_gpu-operator(1ab5bc39-dd70-411f-9592-a6b5b69ff723)
any help on this issue will be very much appreciated
The text was updated successfully, but these errors were encountered: