Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Operator on disconnected Openshift 4.15 #1165

Open
bdf-catalyse opened this issue Dec 11, 2024 · 0 comments
Open

GPU Operator on disconnected Openshift 4.15 #1165

bdf-catalyse opened this issue Dec 11, 2024 · 0 comments

Comments

@bdf-catalyse
Copy link

Hello,
we installed GPU Operator v24.9.0 provided through Red Hat marketplace OLM catalog in our disconnected 4.15.28 Openshift cluster
nvidia-driver-ctr is reporting complete installation and a quick vector-add application terminated correctly, so I don't think this issue is blocking.
However, we have few questions regarding nvidia-driver-ctr :

Default UBI Repositories / UBI Base images

The nvidia-driver-ctr container uses ubi8 repositories, while the openshift-driver-toolkit-ctr container uses ubi9 repositories. This is due to their underlying base image versions.

$  oc -n ocp-nvidia-gpu-operator exec -c nvidia-driver-ctr nvidia-driver-daemonset-415.92.202408100433-0-2gcjc  -- cat /etc/redhat-release
Red Hat Enterprise Linux release 8.10 (Ootpa)

[openshift@bdf-build8-bastion entitlement]$ oc -n ocp-nvidia-gpu-operator exec -c nvidia-driver-ctr nvidia-driver-daemonset-415.92.202408100433-0-kcpkp  -- cat /etc/yum.repos.d/ubi.repo
[ubi-8-baseos-rpms]
name = Red Hat Universal Base Image 8 (RPMs) - BaseOS
baseurl = https://cdn-ubi.redhat.com/content/public/ubi/dist/ubi8/8/$basearch/baseos/os
enabled = 1
gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release
gpgcheck = 1
....

$  oc -n ocp-nvidia-gpu-operator exec -c openshift-driver-toolkit-ctr nvidia-driver-daemonset-415.92.202408100433-0-2gcjc  -- cat /etc/redhat-release
Red Hat Enterprise Linux release 9.2 (Plow)
[openshift@bdf-build8-bastion entitlement]$ oc -n ocp-nvidia-gpu-operator exec -c openshift-driver-toolkit-ctr nvidia-driver-daemonset-415.92.202408100433-0-kcpkp  -- cat /etc/yum.repos.d/ubi.repo
[ubi-9-baseos]
name = Red Hat Universal Base Image 9 (RPMs) - BaseOS
baseurl = https://cdn-ubi.redhat.com/content/public/ubi/dist/ubi9/9/$basearch/baseos/os
enabled = 1
gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release
gpgcheck = 1
... 

our questions here:

  • Does this difference in UBI repositories potentially break compatibility for the kernel modules?
  • will a new version use UBI9 soon?

Entitlement Key Mounting

As we are using Satellite to provide repositories in our disconnected environment, we need to entitle pods to access the different repositories.

$ oc get cm -n ocp-nvidia-gpu-operator yum-repos -o yaml
apiVersion: v1
data:
  redhat.repo: |
    [rhel-9-for-x86_64-baseos-rpms]
    name = Red Hat Enterprise Linux 9 for x86_64 - BaseOS (RPMs)
    baseurl = https://<9.2 baseos repo>
    enabled = 1
    gpgcheck = 0
    repo_gpgcheck = 0
    sslverify = 0
    module_hotfixes=true
    sslclientkey = /run/secrets/etc-pki-entitlement/<entitlement_key_name>-key.pem
    sslclientcert = /run/secrets/etc-pki-entitlement/<entitlement_key_name>.pem


    [rhel-9-for-x86_64-appstream-rpms]
    name = Red Hat Enterprise Linux 9 for x86_64 - AppStream (RPMs)
    baseurl = https://<9.2 appstream repo>
    enabled = 1
    gpgcheck = 0
    sslverify = 0
    repo_gpgcheck = 0
    module_hotfixes=true
    sslclientkey = /run/secrets/etc-pki-entitlement/<entitlement_key_name>-key.pem
    sslclientcert = /run/secrets/etc-pki-entitlement/<entitlement_key_name>.pem

kind: ConfigMap
metadata:
  name: yum-repos
  namespace: ocp-nvidia-gpu-operator


$ oc get clusterpolicies.nvidia.com gpu-cluster-policy -o json | jq '.spec.driver.repoConfig'

{
  "configMapName": "yum-repos"
}

The relevant ConfigMap (yum-repos) is mounted only to the nvidia-driver-ctr container, not the openshift-driver-toolkit-ctr container.

$ oc -n ocp-nvidia-gpu-operator exec -c openshift-driver-toolkit-ctr nvidia-driver-daemonset-415.92.202408100433-0-kcpkp  -- cat /etc/yum.repos.d/redhat.repo
#
# Certificate-Based Repositories
# Managed by (rhsm) subscription-manager
#
# *** This file is auto-generated.  Changes made here will be over-written. ***
# *** Use "subscription-manager repo-override --help" if you wish to make changes. ***
#
# If this file is empty and this system is subscribed consider
# a "yum repolist" to refresh available repos
#

$ oc -n ocp-nvidia-gpu-operator exec -c nvidia-driver-ctr nvidia-driver-daemonset-415.92.202408100433-0-kcpkp  -- cat /etc/yum.repos.d/redhat.repo
    [rhel-9-for-x86_64-baseos-rpms]
    name = Red Hat Enterprise Linux 9 for x86_64 - BaseOS (RPMs)
    baseurl = https://<9.2 baseos repo>
    enabled = 1
    gpgcheck = 0
    repo_gpgcheck = 0
    sslverify = 0
    module_hotfixes=true
    sslclientkey = /run/secrets/etc-pki-entitlement/<entitlement_key_name>-key.pem
    sslclientcert = /run/secrets/etc-pki-entitlement/<entitlement_key_name>.pem
.....

our questions here:

  • Is there a way to inject the yum-repos ConfigMap into the openshift-driver-toolkit-ctr container as well?
  • If not, should we consider building a custom ImageStream for the GPU Operator that includes a modified driver-toolkit image with the necessary repositories pre-configured? Currently, the Operator injects a sidecar with openshift/istag/driver-toolkit:${RHCOS_VERSION}.

openshift-driver-toolkit-ctr Container Logs

The openshift-driver-toolkit-ctr container logs show attempts to enable additional repositories:

$ oc -n ocp-nvidia-gpu-operator logs -c openshift-driver-toolkit-ctr nvidia-driver-daemonset-415.92.202408100433-0-kcpkp  | grep dnf
+ ln -s /usr/bin/true /mnt/shared-nvidia-driver-toolkit/bin/dnf --force
+ dnf install -q -y elfutils-libelf.x86_64 elfutils-libelf-devel.x86_64
+ dnf config-manager --set-enabled rhocp-4.15-for-rhel-8-x86_64-rpms
+ dnf makecache --releasever=9.2
+ dnf config-manager --set-enabled rhel-8-for-x86_64-baseos-eus-rpms
+ dnf makecache --releasever=9.2
+ dnf makecache --releasever=9.2
+ dnf -q -y --releasever=9.2 install kernel-headers-5.14.0-284.79.1.el9_2.x86_64 kernel-devel-5.14.0-284.79.1.el9_2.x86_64
+ dnf -q -y --releasever=9.2 install kernel-core-5.14.0-284.79.1.el9_2.x86_64
+ dnf install -q -y --releasever=9.2 gcc-

Question:

What specific repositories are truly needed for the openshift-driver-toolkit-ctr container to function correctly?

Need for openshift-driver-toolkit-ctr container

As I see that nvidia-driver-ctr is reporting complete installation despite precedent problems, do we really need this openshift-driver-toolkit-ctr sidecontainer ?

Thank you in advance for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant