You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello,
we installed GPU Operator v24.9.0 provided through Red Hat marketplace OLM catalog in our disconnected 4.15.28 Openshift cluster
nvidia-driver-ctr is reporting complete installation and a quick vector-add application terminated correctly, so I don't think this issue is blocking.
However, we have few questions regarding nvidia-driver-ctr :
Default UBI Repositories / UBI Base images
The nvidia-driver-ctr container uses ubi8 repositories, while the openshift-driver-toolkit-ctr container uses ubi9 repositories. This is due to their underlying base image versions.
$ oc -n ocp-nvidia-gpu-operator exec -c nvidia-driver-ctr nvidia-driver-daemonset-415.92.202408100433-0-2gcjc -- cat /etc/redhat-release
Red Hat Enterprise Linux release 8.10 (Ootpa)
[openshift@bdf-build8-bastion entitlement]$ oc -n ocp-nvidia-gpu-operator exec -c nvidia-driver-ctr nvidia-driver-daemonset-415.92.202408100433-0-kcpkp -- cat /etc/yum.repos.d/ubi.repo
[ubi-8-baseos-rpms]
name = Red Hat Universal Base Image 8 (RPMs) - BaseOS
baseurl = https://cdn-ubi.redhat.com/content/public/ubi/dist/ubi8/8/$basearch/baseos/os
enabled = 1
gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release
gpgcheck = 1
....
$ oc -n ocp-nvidia-gpu-operator exec -c openshift-driver-toolkit-ctr nvidia-driver-daemonset-415.92.202408100433-0-2gcjc -- cat /etc/redhat-release
Red Hat Enterprise Linux release 9.2 (Plow)
[openshift@bdf-build8-bastion entitlement]$ oc -n ocp-nvidia-gpu-operator exec -c openshift-driver-toolkit-ctr nvidia-driver-daemonset-415.92.202408100433-0-kcpkp -- cat /etc/yum.repos.d/ubi.repo
[ubi-9-baseos]
name = Red Hat Universal Base Image 9 (RPMs) - BaseOS
baseurl = https://cdn-ubi.redhat.com/content/public/ubi/dist/ubi9/9/$basearch/baseos/os
enabled = 1
gpgkey = file:///etc/pki/rpm-gpg/RPM-GPG-KEY-redhat-release
gpgcheck = 1
...
our questions here:
Does this difference in UBI repositories potentially break compatibility for the kernel modules?
will a new version use UBI9 soon?
Entitlement Key Mounting
As we are using Satellite to provide repositories in our disconnected environment, we need to entitle pods to access the different repositories.
$ oc get cm -n ocp-nvidia-gpu-operator yum-repos -o yaml
apiVersion: v1
data:
redhat.repo: |
[rhel-9-for-x86_64-baseos-rpms]
name = Red Hat Enterprise Linux 9 for x86_64 - BaseOS (RPMs)
baseurl = https://<9.2 baseos repo>
enabled = 1
gpgcheck = 0
repo_gpgcheck = 0
sslverify = 0
module_hotfixes=true
sslclientkey = /run/secrets/etc-pki-entitlement/<entitlement_key_name>-key.pem
sslclientcert = /run/secrets/etc-pki-entitlement/<entitlement_key_name>.pem
[rhel-9-for-x86_64-appstream-rpms]
name = Red Hat Enterprise Linux 9 for x86_64 - AppStream (RPMs)
baseurl = https://<9.2 appstream repo>
enabled = 1
gpgcheck = 0
sslverify = 0
repo_gpgcheck = 0
module_hotfixes=true
sslclientkey = /run/secrets/etc-pki-entitlement/<entitlement_key_name>-key.pem
sslclientcert = /run/secrets/etc-pki-entitlement/<entitlement_key_name>.pem
kind: ConfigMap
metadata:
name: yum-repos
namespace: ocp-nvidia-gpu-operator
$ oc get clusterpolicies.nvidia.com gpu-cluster-policy -o json | jq '.spec.driver.repoConfig'
{
"configMapName": "yum-repos"
}
The relevant ConfigMap (yum-repos) is mounted only to the nvidia-driver-ctr container, not the openshift-driver-toolkit-ctr container.
$ oc -n ocp-nvidia-gpu-operator exec -c openshift-driver-toolkit-ctr nvidia-driver-daemonset-415.92.202408100433-0-kcpkp -- cat /etc/yum.repos.d/redhat.repo
#
# Certificate-Based Repositories
# Managed by (rhsm) subscription-manager
#
# *** This file is auto-generated. Changes made here will be over-written. ***
# *** Use "subscription-manager repo-override --help" if you wish to make changes. ***
#
# If this file is empty and this system is subscribed consider
# a "yum repolist" to refresh available repos
#
$ oc -n ocp-nvidia-gpu-operator exec -c nvidia-driver-ctr nvidia-driver-daemonset-415.92.202408100433-0-kcpkp -- cat /etc/yum.repos.d/redhat.repo
[rhel-9-for-x86_64-baseos-rpms]
name = Red Hat Enterprise Linux 9 for x86_64 - BaseOS (RPMs)
baseurl = https://<9.2 baseos repo>
enabled = 1
gpgcheck = 0
repo_gpgcheck = 0
sslverify = 0
module_hotfixes=true
sslclientkey = /run/secrets/etc-pki-entitlement/<entitlement_key_name>-key.pem
sslclientcert = /run/secrets/etc-pki-entitlement/<entitlement_key_name>.pem
.....
our questions here:
Is there a way to inject the yum-repos ConfigMap into the openshift-driver-toolkit-ctr container as well?
If not, should we consider building a custom ImageStream for the GPU Operator that includes a modified driver-toolkit image with the necessary repositories pre-configured? Currently, the Operator injects a sidecar with openshift/istag/driver-toolkit:${RHCOS_VERSION}.
openshift-driver-toolkit-ctr Container Logs
The openshift-driver-toolkit-ctr container logs show attempts to enable additional repositories:
What specific repositories are truly needed for the openshift-driver-toolkit-ctr container to function correctly?
Need for openshift-driver-toolkit-ctr container
As I see that nvidia-driver-ctr is reporting complete installation despite precedent problems, do we really need this openshift-driver-toolkit-ctr sidecontainer ?
Thank you in advance for your help!
The text was updated successfully, but these errors were encountered:
Hello,
we installed GPU Operator v24.9.0 provided through Red Hat marketplace OLM catalog in our disconnected 4.15.28 Openshift cluster
nvidia-driver-ctr is reporting complete installation and a quick vector-add application terminated correctly, so I don't think this issue is blocking.
However, we have few questions regarding nvidia-driver-ctr :
Default UBI Repositories / UBI Base images
The nvidia-driver-ctr container uses ubi8 repositories, while the openshift-driver-toolkit-ctr container uses ubi9 repositories. This is due to their underlying base image versions.
our questions here:
Entitlement Key Mounting
As we are using Satellite to provide repositories in our disconnected environment, we need to entitle pods to access the different repositories.
The relevant ConfigMap (yum-repos) is mounted only to the nvidia-driver-ctr container, not the openshift-driver-toolkit-ctr container.
our questions here:
openshift-driver-toolkit-ctr Container Logs
The openshift-driver-toolkit-ctr container logs show attempts to enable additional repositories:
Question:
What specific repositories are truly needed for the openshift-driver-toolkit-ctr container to function correctly?
Need for openshift-driver-toolkit-ctr container
As I see that nvidia-driver-ctr is reporting complete installation despite precedent problems, do we really need this openshift-driver-toolkit-ctr sidecontainer ?
Thank you in advance for your help!
The text was updated successfully, but these errors were encountered: