Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In GPU operator v23.6.1 driver Images are missing for rhel8.4 and rhel8.5 #595

Open
moditanisha22 opened this issue Oct 8, 2023 · 2 comments

Comments

@moditanisha22
Copy link

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Rhel 8.4/ Rhel 8.5
  • Kernel Version: 4.18.0-305.el8.x86_64
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8s
  • GPU Operator Version: 23.6.1

2. Issue or feature description

I tried to install gpu operator v23.6.1 on the Kubernetes cluster having os version rhel8.4. The GPU operator by default tries to pull nvcr.io/nvidia/driver:535.104.05-rhel8.4 , It fails saying that the image is not present in the registry.

3. Steps to reproduce the issue

Install Vanilla Kubernetes cluster with OS version rhel 8.4.
Deploy GPU operator v23.6.1 using helm .

4. Information to attach (optional if deemed irrelevant)

Normal Started 7m29s kubelet Started container k8s-driver-manager
Warning Failed 5m59s (x4 over 6m56s) kubelet Error: ImagePullBackOff
Normal Pulling 5m46s (x4 over 7m23s) kubelet Pulling image "nvcr.io/nvidia/driver:535.104.05-rhel8.4"
Warning Failed 5m45s (x4 over 7m22s) kubelet Failed to pull image "nvcr.io/nvidia/driver:535.104.05-rhel8.4": rpc error: code = NotFound desc = failed to pull and unpack image "nvcr.io/nvidia/driver:535.104.05-rhel8.4": failed to resolve reference "nvcr.io/nvidia/driver:535.104.05-rhel8.4": nvcr.io/nvidia/driver:535.104.05-rhel8.4: not found
Warning Failed 5m45s (x4 over 7m22s) kubelet Error: ErrImagePull
Normal Scheduled 5m33s default-scheduler Successfully assigned gpu-operator/nvidia-driver-daemonset-mgm2f to xyz.net
Normal BackOff 2m55s (x15 over 6m56s) kubelet Back-off pulling image "nvcr.io/nvidia/driver:535.104.05-rhel8.4"

@shivamerla
Copy link
Contributor

@moditanisha22 we had tested only RHEL 8.6/8.7/8.8, hence only those tags were published. With RHEL 8.4, you can get around it by using image digest instead of version (i.e by setting --set driver.version=sha256:3382e254056f28831767bc6729bc2594353a5ff2a28fe9f2d94396c597bb23d8)

$docker regctl manifest get nvcr.io/nvidia/driver:535.104.05-rhel8.6
Name:        nvcr.io/nvidia/driver:535.104.05-rhel8.6
MediaType:   application/vnd.docker.distribution.manifest.list.v2+json
Digest:      sha256:3382e254056f28831767bc6729bc2594353a5ff2a28fe9f2d94396c597bb23d8
             
Manifests:   
             
  Name:      nvcr.io/nvidia/driver:535.104.05-rhel8.6@sha256:a3939fa9a518d05b1e77a50ad20a6896d4505ba6350a0a2a61f14d799dcacfe0
  Digest:    sha256:a3939fa9a518d05b1e77a50ad20a6896d4505ba6350a0a2a61f14d799dcacfe0
  MediaType: application/vnd.docker.distribution.manifest.v2+json
  Platform:  linux/amd64
             
  Name:      nvcr.io/nvidia/driver:535.104.05-rhel8.6@sha256:5dca1baf8f58ede49dc64c2f8a97a939dfa2c3841e7e1cbf300bd9751753c1a5
  Digest:    sha256:5dca1baf8f58ede49dc64c2f8a97a939dfa2c3841e7e1cbf300bd9751753c1a5
  MediaType: application/vnd.docker.distribution.manifest.v2+json
  Platform:  linux/arm64

@lengrongfu
Copy link
Contributor

1. 快速调试信息

  • 操作系统/版本(例如RHEL8.6、Ubuntu22.04):Rhel 8.4/ Rhel 8.5
  • 内核版本:4.18.0-305.el8.x86_64
  • 容器运行时类型/版本(例如 Containerd、CRI-O、Docker):Containerd
  • K8s 风味/版本(例如 K8s、OCP、Rancher、GKE、EKS):K8s
  • GPU 操作员版本:23.6.1

2. 问题或功能描述

我尝试在操作系统版本为 rhel8.4 的 Kubernetes 集群上安装 GPU Operator v23.6.1。默认情况下,GPU 操作员会尝试拉取nvcr.io/nvidia/driver:535.104.05-rhel8.4,但失败并提示该映像不存在于注册表中。

3. 重现问题的步骤

安装操作系统版本为 rhel 8.4 的 Vanilla Kubernetes 集群。 使用 helm 部署 GPU Operator v23.6.1 。

4.附加信息(如果认为不相关则可选)

正常启动 7m29s kubelet 启动容器 k8s-driver-manager 警告失败 5m59s(x4 超过 6m56s)kubelet 错误:ImagePullBackOff 正常拉取 5m46s(x4 超过 7m23s)kubelet 拉取映像“nvcr.io/nvidia/driver:535.104.05-rhel8.4 “ 警告失败 5 分 45 秒(x4 超过 7 分 22 秒)kubelet 无法拉取映像“nvcr.io/nvidia/driver:535.104.05-rhel8.4”:rpc 错误:代码 = NotFound desc = 无法拉取和解压映像“nvcr.io” /nvidia/driver:535.104.05-rhel8.4”:无法解析引用“nvcr.io/nvidia/driver:535.104.05-rhel8.4”:nvcr.io/nvidia/driver:535.104.05-rhel8。 4:未找到 警告失败 5m45s(x4 超过 7m22s)kubelet 错误:ErrImagePull 正常调度 5m33s 默认调度程序已成功将 gpu-operator/nvidia-driver-daemonset-mgm2f 分配给 xyz.net 正常 BackOff 2m55s(x15 超过 6m56s)kubelet Back-关闭拉取映像“nvcr.io/nvidia/driver:535.104.05-rhel8.4”

you can search this page, rhel8.4 last version is 525.105.17-rhel8.4.
https://catalog.ngc.nvidia.com/orgs/nvidia/containers/driver/tags

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants