Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu-operator install fails with driver pod errors 'Failed to download metadata for repo 'rhel-8-for-x86_64-appstream-rpms'' #616

Open
3 tasks
aneesh786 opened this issue Nov 23, 2023 · 1 comment

Comments

@aneesh786
Copy link

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): RHEL8.6
  • Kernel Version: 4.18.0-372.9.1.el8.x86_64
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): cri-o://1.26.4
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8S 1.27.1
  • GPU Operator Version: 23.9.x

2. Issue or feature description

Iam trying to install gpu operator using helm. During install, driver pod(nvidia-driver-daemonset-fwcvl) fails with below error
below are pod logs -- omitted the initial part and added only error logs.

  • '[' '' '!=' builtin ']'
    Updating the package cache...
  • echo 'Updating the package cache...'
  • yum -q makecache
    Error: Failed to download metadata for repo 'rhel-8-for-x86_64-appstream-rpms': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried
    FATAL: failed to reach RHEL package repositories. Ensure that the cluster can access the proper networks.
  • echo 'FATAL: failed to reach RHEL package repositories. ' 'Ensure that the cluster can access the proper networks.'
  • kubernetes pods status: kubectl get pods -n gpu-operator
    gpu-feature-discovery-zqm9h 0/1 Init:0/1 0 86m
    gpu-operator-1700756391-node-feature-discovery-gc-5c546559bfmj2 1/1 Running 0 93m
    gpu-operator-1700756391-node-feature-discovery-master-79796bzcb 1/1 Running 0 93m
    gpu-operator-1700756391-node-feature-discovery-worker-6ddld 1/1 Running 0 93m
    gpu-operator-1700756391-node-feature-discovery-worker-8c2k4 1/1 Running 0 93m
    gpu-operator-1700756391-node-feature-discovery-worker-nzd7b 1/1 Running 0 93m
    gpu-operator-1700756391-node-feature-discovery-worker-x8nx9 1/1 Running 0 93m
    gpu-operator-68d85f45d-v97fz 1/1 Running 0 93m
    nvidia-container-toolkit-daemonset-kqmtx 0/1 Init:0/1 0 86m
    nvidia-dcgm-exporter-5ncg7 0/1 Init:0/1 0 86m
    nvidia-device-plugin-daemonset-qmvhc 0/1 Init:0/1 0 86m
    nvidia-driver-daemonset-fwcvl 0/1 CrashLoopBackOff 19 (3m20s ago) 87m
    nvidia-operator-validator-vcztn 0/1 Init:0/4 0 86m

  • kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
    NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
    gpu-feature-discovery 1 1 0 1 0 nvidia.com/gpu.deploy.gpu-feature-discovery=true 94m
    gpu-operator-1700756391-node-feature-discovery-worker 4 4 4 4 4 94m
    nvidia-container-toolkit-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.container-toolkit=true 94m
    nvidia-dcgm-exporter 1 1 0 1 0 nvidia.com/gpu.deploy.dcgm-exporter=true 94m
    nvidia-device-plugin-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.device-plugin=true 94m
    nvidia-driver-daemonset 1 1 0 1 0 nvidia.com/gpu.deploy.driver=true 94m
    nvidia-mig-manager 0 0 0 0 0 nvidia.com/gpu.deploy.mig-manager=true 94m
    nvidia-operator-validator 1 1 0 1 0 nvidia.com/gpu.deploy.operator-validator=true 94m

  • If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
    k describe po nvidia-driver-daemonset-fwcvl
    Name: nvidia-driver-daemonset-fwcvl
    Namespace: gpu-operator
    Priority: 2000001000
    Priority Class Name: system-node-critical
    Service Account: nvidia-driver
    Node: lab-worker-4/172.21.1.70
    Start Time: Thu, 23 Nov 2023 11:26:21 -0500
    Labels: app=nvidia-driver-daemonset
    app.kubernetes.io/component=nvidia-driver
    app.kubernetes.io/managed-by=gpu-operator
    controller-revision-hash=5954d75477
    helm.sh/chart=gpu-operator-v23.9.0
    nvidia.com/precompiled=false
    pod-template-generation=1
    Annotations: cni.projectcalico.org/containerID: 14eb92fe162f5d1ddcf0d32343f0815ae1325dfca8eb88354d979f7cbc335c5d
    cni.projectcalico.org/podIP: 192.168.148.114/32
    cni.projectcalico.org/podIPs: 192.168.148.114/32
    kubectl.kubernetes.io/default-container: nvidia-driver-ctr
    Status: Running
    IP: 192.168.148.114
    IPs:
    IP: 192.168.148.114
    Controlled By: DaemonSet/nvidia-driver-daemonset
    Init Containers:
    k8s-driver-manager:
    Container ID: cri-o://b15e393c5603042c1938c49f132a706332ba76bb21dab6ea2d50a0fe2a0cf3b3
    Image: nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.6.4
    Image ID: nvcr.io/nvidia/cloud-native/k8s-driver-manager@sha256:5ca81f4f7e55f7b304dbbb7aaa235fca2656789145e4b34f47a7ab7079704dc7
    Port:
    Host Port:
    Command:
    driver-manager
    Args:
    uninstall_driver
    State: Terminated
    Reason: Completed
    Exit Code: 0
    Started: Thu, 23 Nov 2023 11:26:22 -0500
    Finished: Thu, 23 Nov 2023 11:26:54 -0500
    Ready: True
    Restart Count: 0
    Environment:
    NODE_NAME: (v1:spec.nodeName)
    NVIDIA_VISIBLE_DEVICES: void
    ENABLE_GPU_POD_EVICTION: true
    ENABLE_AUTO_DRAIN: false
    DRAIN_USE_FORCE: false
    DRAIN_POD_SELECTOR_LABEL:
    DRAIN_TIMEOUT_SECONDS: 0s
    DRAIN_DELETE_EMPTYDIR_DATA: false
    OPERATOR_NAMESPACE: gpu-operator (v1:metadata.namespace)
    Mounts:
    /host from host-root (ro)
    /run/nvidia from run-nvidia (rw)
    /sys from host-sys (rw)
    /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qphz2 (ro)
    Containers:
    nvidia-driver-ctr:
    Container ID: cri-o://8139fed89018b0c4382884f44dfa1f7146711824baf3029b9b8b416e4e91c9f5
    Image: nvcr.io/nvidia/driver:525.125.06-rhel8.6
    Image ID: nvcr.io/nvidia/driver@sha256:b58167d31d34784cd7c425961234d67c5e2d22eb4a5312681d0337dae812f746
    Port:
    Host Port:
    Command:
    nvidia-driver
    Args:
    init
    State: Waiting
    Reason: CrashLoopBackOff
    Last State: Terminated
    Reason: Error
    Exit Code: 1
    Started: Thu, 23 Nov 2023 12:49:50 -0500
    Finished: Thu, 23 Nov 2023 12:50:24 -0500
    Ready: False
    Restart Count: 19
    Startup: exec [sh -c nvidia-smi && touch /run/nvidia/validations/.driver-ctr-ready] delay=60s timeout=60s period=10s #success=1 #failure=120
    Environment:
    Mounts:
    /dev/log from dev-log (rw)
    /host-etc/os-release from host-os-release (ro)
    /run/mellanox/drivers from run-mellanox-drivers (rw)
    /run/mellanox/drivers/usr/src from mlnx-ofed-usr-src (rw)
    /run/nvidia from run-nvidia (rw)
    /run/nvidia-topologyd from run-nvidia-topologyd (rw)
    /var/log from var-log (rw)
    /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-qphz2 (ro)
    Conditions:
    Type Status
    Initialized True
    Ready False
    ContainersReady False
    PodScheduled True
    Volumes:
    run-nvidia:
    Type: HostPath (bare host directory volume)
    Path: /run/nvidia
    HostPathType: DirectoryOrCreate
    var-log:
    Type: HostPath (bare host directory volume)
    Path: /var/log
    HostPathType:
    dev-log:
    Type: HostPath (bare host directory volume)
    Path: /dev/log
    HostPathType:
    host-os-release:
    Type: HostPath (bare host directory volume)
    Path: /etc/os-release
    HostPathType:
    run-nvidia-topologyd:
    Type: HostPath (bare host directory volume)
    Path: /run/nvidia-topologyd
    HostPathType: DirectoryOrCreate
    mlnx-ofed-usr-src:
    Type: HostPath (bare host directory volume)
    Path: /run/mellanox/drivers/usr/src
    HostPathType: DirectoryOrCreate
    run-mellanox-drivers:
    Type: HostPath (bare host directory volume)
    Path: /run/mellanox/drivers
    HostPathType: DirectoryOrCreate
    run-nvidia-validations:
    Type: HostPath (bare host directory volume)
    Path: /run/nvidia/validations
    HostPathType: DirectoryOrCreate
    host-root:
    Type: HostPath (bare host directory volume)
    Path: /
    HostPathType:
    host-sys:
    Type: HostPath (bare host directory volume)
    Path: /sys
    HostPathType: Directory
    kube-api-access-qphz2:
    Type: Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds: 3607
    ConfigMapName: kube-root-ca.crt
    ConfigMapOptional:
    DownwardAPI: true
    QoS Class: BestEffort
    Node-Selectors: nvidia.com/gpu.deploy.driver=true
    Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
    node.kubernetes.io/memory-pressure:NoSchedule op=Exists
    node.kubernetes.io/not-ready:NoExecute op=Exists
    node.kubernetes.io/pid-pressure:NoSchedule op=Exists
    node.kubernetes.io/unreachable:NoExecute op=Exists
    node.kubernetes.io/unschedulable:NoSchedule op=Exists
    nvidia.com/gpu:NoSchedule op=Exists
    Events:
    Type Reason Age From Message


Warning BackOff 3m53s (x350 over 87m) kubelet Back-off restarting failed container nvidia-driver-ctr in pod nvidia-driver-daemonset-fwcvl_gpu-operator(1ab5bc39-dd70-411f-9592-a6b5b69ff723)

any help on this issue will be very much appreciated

@aneesh786 aneesh786 changed the title gpu-operator install fails gpu-operator install fails driver pod errors 'Failed to download metadata for repo 'rhel-8-for-x86_64-appstream-rpms'' Nov 24, 2023
@aneesh786 aneesh786 changed the title gpu-operator install fails driver pod errors 'Failed to download metadata for repo 'rhel-8-for-x86_64-appstream-rpms'' gpu-operator install fails with driver pod errors 'Failed to download metadata for repo 'rhel-8-for-x86_64-appstream-rpms'' Nov 24, 2023
@aneesh786
Copy link
Author

Help!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant