gpu-operator install fails with driver pod errors 'Failed to download metadata for repo 'rhel-8-for-x86_64-appstream-rpms'' #616

aneesh786 · 2023-11-23T18:13:26Z

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04): RHEL8.6
Kernel Version: 4.18.0-372.9.1.el8.x86_64
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): cri-o://1.26.4
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8S 1.27.1
GPU Operator Version: 23.9.x

2. Issue or feature description

Iam trying to install gpu operator using helm. During install, driver pod(nvidia-driver-daemonset-fwcvl) fails with below error
below are pod logs -- omitted the initial part and added only error logs.

'[' '' '!=' builtin ']'
Updating the package cache...
echo 'Updating the package cache...'
yum -q makecache
Error: Failed to download metadata for repo 'rhel-8-for-x86_64-appstream-rpms': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried
FATAL: failed to reach RHEL package repositories. Ensure that the cluster can access the proper networks.
echo 'FATAL: failed to reach RHEL package repositories. ' 'Ensure that the cluster can access the proper networks.'

kubernetes pods status: kubectl get pods -n gpu-operator
gpu-feature-discovery-zqm9h 0/1 Init:0/1 0 86m
gpu-operator-1700756391-node-feature-discovery-gc-5c546559bfmj2 1/1 Running 0 93m
gpu-operator-1700756391-node-feature-discovery-master-79796bzcb 1/1 Running 0 93m
gpu-operator-1700756391-node-feature-discovery-worker-6ddld 1/1 Running 0 93m
gpu-operator-1700756391-node-feature-discovery-worker-8c2k4 1/1 Running 0 93m
gpu-operator-1700756391-node-feature-discovery-worker-nzd7b 1/1 Running 0 93m
gpu-operator-1700756391-node-feature-discovery-worker-x8nx9 1/1 Running 0 93m
gpu-operator-68d85f45d-v97fz 1/1 Running 0 93m
nvidia-container-toolkit-daemonset-kqmtx 0/1 Init:0/1 0 86m
nvidia-dcgm-exporter-5ncg7 0/1 Init:0/1 0 86m
nvidia-device-plugin-daemonset-qmvhc 0/1 Init:0/1 0 86m
nvidia-driver-daemonset-fwcvl 0/1 CrashLoopBackOff 19 (3m20s ago) 87m
nvidia-operator-validator-vcztn 0/1 Init:0/4 0 86m
kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
gpu-feature-discovery 1 1 0 1 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 94m
gpu-operator-1700756391-node-feature-discovery-worker 4 4 4 4 4 94m
nvidia-container-toolkit-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.container-toolkit=true 94m
nvidia-dcgm-exporter 1 1 0 1 0 nvidia.com/gpu.deploy.dcgm-exporter=true 94m
nvidia-device-plugin-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.device-plugin=true 94m
nvidia-driver-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.driver=true 94m
nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 94m
nvidia-operator-validator 1 1 0 1 0 nvidia.com/gpu.deploy.operator-validator=true 94m
If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
k describe po nvidia-driver-daemonset-fwcvl
Name: nvidia-driver-daemonset-fwcvl
Namespace: gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: nvidia-driver
Node: lab-worker-4/172.21.1.70
Start Time: Thu, 23 Nov 2023 11:26:21 -0500
Labels: app=nvidia-driver-daemonset
app.kubernetes.io/component=nvidia-driver
app.kubernetes.io/managed-by=gpu-operator
controller-revision-hash=5954d75477
helm.sh/chart=gpu-operator-v23.9.0
nvidia.com/precompiled=false
pod-template-generation=1
Annotations: cni.projectcalico.org/containerID: 14eb92fe162f5d1ddcf0d32343f0815ae1325dfca8eb88354d979f7cbc335c5d
cni.projectcalico.org/podIP: 192.168.148.114/32
cni.projectcalico.org/podIPs: 192.168.148.114/32
kubectl.kubernetes.io/default-container: nvidia-driver-ctr
Status: Running
IP: 192.168.148.114
IPs:
IP: 192.168.148.114
Controlled By: DaemonSet/nvidia-driver-daemonset
Init Containers:
k8s-driver-manager:
Container ID: cri-o://b15e393c5603042c1938c49f132a706332ba76bb21dab6ea2d50a0fe2a0cf3b3
Image: nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.4
Image ID: nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:5ca81f4f7e55f7b304dbbb7aaa235fca2656789145e4b34f47a7ab7079704dc7
Port:
Host Port:
Command:
driver-manager
Args:
uninstall_driver
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 23 Nov 2023 11:26:22 -0500
Finished: Thu, 23 Nov 2023 11:26:54 -0500
Ready: True
Restart Count: 0
Environment:
NODE_NAME: (v1:spec.nodeName)
NVIDIA_VISIBLE_DEVICES: void
ENABLE_GPU_POD_EVICTION: true
ENABLE_AUTO_DRAIN: false
DRAIN_USE_FORCE: false
DRAIN_POD_SELECTOR_LABEL:
DRAIN_TIMEOUT_SECONDS: 0s
DRAIN_DELETE_EMPTYDIR_DATA: false
OPERATOR_NAMESPACE: gpu-operator (v1:metadata.namespace)
Mounts:
/host from host-root (ro)
/run/nvidia from run-nvidia (rw)
/sys from host-sys (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qphz2 (ro)
Containers:
nvidia-driver-ctr:
Container ID: cri-o://8139fed89018b0c4382884f44dfa1f7146711824baf3029b9b8b416e4e91c9f5
Image: nvcr.io/nvidia/driver:525.125.06-rhel8.6
Image ID: nvcr.io/nvidia/driver@sha256:b58167d31d34784cd7c425961234d67c5e2d22eb4a5312681d0337dae812f746
Port:
Host Port:
Command:
nvidia-driver
Args:
init
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Thu, 23 Nov 2023 12:49:50 -0500
Finished: Thu, 23 Nov 2023 12:50:24 -0500
Ready: False
Restart Count: 19
Startup: exec [sh -c nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready] delay=60s timeout=60s period=10s #success=1 #failure=120
Environment:
Mounts:
/dev/log from dev-log (rw)
/host-etc/os-release from host-os-release (ro)
/run/mellanox/drivers from run-mellanox-drivers (rw)
/run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)
/run/nvidia from run-nvidia (rw)
/run/nvidia-topologyd from run-nvidia-topologyd (rw)
/var/log from var-log (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qphz2 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
run-nvidia:
Type: HostPath (bare host directory volume)
Path: /run/nvidia
HostPathType: DirectoryOrCreate
var-log:
Type: HostPath (bare host directory volume)
Path: /var/log
HostPathType:
dev-log:
Type: HostPath (bare host directory volume)
Path: /dev/log
HostPathType:
host-os-release:
Type: HostPath (bare host directory volume)
Path: /etc/os-release
HostPathType:
run-nvidia-topologyd:
Type: HostPath (bare host directory volume)
Path: /run/nvidia-topologyd
HostPathType: DirectoryOrCreate
mlnx-ofed-usr-src:
Type: HostPath (bare host directory volume)
Path: /run/mellanox/drivers/usr/src
HostPathType: DirectoryOrCreate
run-mellanox-drivers:
Type: HostPath (bare host directory volume)
Path: /run/mellanox/drivers
HostPathType: DirectoryOrCreate
run-nvidia-validations:
Type: HostPath (bare host directory volume)
Path: /run/nvidia/validations
HostPathType: DirectoryOrCreate
host-root:
Type: HostPath (bare host directory volume)
Path: /
HostPathType:
host-sys:
Type: HostPath (bare host directory volume)
Path: /sys
HostPathType: Directory
kube-api-access-qphz2:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: nvidia.com/gpu.deploy.driver=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
nvidia.com/gpu:NoSchedule op=Exists
Events:
Type Reason Age From Message

Warning BackOff 3m53s (x350 over 87m) kubelet Back-off restarting failed container nvidia-driver-ctr in pod nvidia-driver-daemonset-fwcvl_gpu-operator(1ab5bc39-dd70-411f-9592-a6b5b69ff723)

any help on this issue will be very much appreciated

The text was updated successfully, but these errors were encountered:

aneesh786 · 2023-11-27T04:51:58Z

Help!!!

aneesh786 changed the title ~~gpu-operator install fails~~ gpu-operator install fails driver pod errors 'Failed to download metadata for repo 'rhel-8-for-x86_64-appstream-rpms'' Nov 24, 2023

aneesh786 changed the title ~~gpu-operator install fails driver pod errors 'Failed to download metadata for repo 'rhel-8-for-x86_64-appstream-rpms''~~ gpu-operator install fails with driver pod errors 'Failed to download metadata for repo 'rhel-8-for-x86_64-appstream-rpms'' Nov 24, 2023

KodieGlosserIBM mentioned this issue Nov 27, 2023

Do not look at os-release when downloading linux kernel headers #617

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpu-operator install fails with driver pod errors 'Failed to download metadata for repo 'rhel-8-for-x86_64-appstream-rpms'' #616

gpu-operator install fails with driver pod errors 'Failed to download metadata for repo 'rhel-8-for-x86_64-appstream-rpms'' #616

aneesh786 commented Nov 23, 2023

aneesh786 commented Nov 27, 2023

gpu-operator install fails with driver pod errors 'Failed to download metadata for repo 'rhel-8-for-x86_64-appstream-rpms'' #616

gpu-operator install fails with driver pod errors 'Failed to download metadata for repo 'rhel-8-for-x86_64-appstream-rpms'' #616

Comments

aneesh786 commented Nov 23, 2023

1. Quick Debug Information

2. Issue or feature description

aneesh786 commented Nov 27, 2023