Skip to content

Conversation

empovit
Copy link
Contributor

@empovit empovit commented Jul 28, 2025

In disconnected environments dnf install cannot be used without mirroring RPM repositories. As the vGPU manager now requires lspci command:

  • Include the RPMs in the container image
  • If lspci command not found, install it from the RPMs

In disconnected environments `dnf install` cannot be used without
mirroring RPM repositories. As the vGPU manager now requires lspci
command:

* Include the RPMs in the container image
* If lspci command not found, install it from the RPMs

Signed-off-by: Vitaliy Emporopulo <[email protected]>
Copy link

copy-pr-bot bot commented Jul 28, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@SDBrett
Copy link

SDBrett commented Jul 29, 2025

This update uses a UBI8 image to download the pciutils rpms and the driver toolkit container image is built on UBI9.

Could this introduce a potential issue when trying to install them on the driver toolkit container?

RUN mkdir -p /driver/rpms/pciutils
WORKDIR /driver/rpms/pciutils

RUN dnf download --resolve pciutils && dnf clean all
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just install the rpm package instead of downloading it and running it later?

Do we want to use pciutils in the DTK container? If so, we should consider baking pciutils into DTK

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, currently lspci is needed at runtime by the sriov-manage script that we invoke in the DTK container. Vitaly asked the DTK team if they could include the pciutils package in the DTK image, but they said no.

@tariq1890
Copy link
Contributor

It's also probably a good idea to create a rhel9 version of the vgpu-manager container which will source packages from ubi9 repositories. gpu-operator will no longer support RHEL8 based OCP clusters going forward

@SDBrett
Copy link

SDBrett commented Jul 29, 2025 via email

@empovit
Copy link
Contributor Author

empovit commented Jul 30, 2025

It means the DTK and vgpu-manager images will have to be updated manually during every cluster upgrade.
The driver toolkit image has to match the cluster's kernel version, and is automatically managed for each OpenShift version. As the purpose of DTK is building drivers, the team is saying that lspci does not belong in the image.

I like @tariq1890 's idea to bump the base image.

Other options I can think of:

  1. Multi-stage build with UBI 9 (not DTK). This option will require re-building vgpu-manager when a cluster (and DTK) goes from RHEL 9 to RHEL 10. Unfortunately, the same issue exists even if the vgpu-manager itself becomes UBI 9.
  2. Have a tiny image for just lspci, or have lspci in the vgpu-manager image and pass the data (not the binary) between containers.

@SDBrett
Copy link

SDBrett commented Jul 31, 2025

It means the DTK and vgpu-manager images will have to be updated manually during every cluster upgrade. The driver toolkit image has to match the cluster's kernel version, and is automatically managed for each OpenShift version. As the purpose of DTK is building drivers, the team is saying that lspci does not belong in the image.

I like @tariq1890 's idea to bump the base image.

Other options I can think of:

  1. Multi-stage build with UBI 9 (not DTK). This option will require re-building vgpu-manager when a cluster (and DTK) goes from RHEL 9 to RHEL 10. Unfortunately, the same issue exists even if the vgpu-manager itself becomes UBI 9.
  2. Have a tiny image for just lspci, or have lspci in the vgpu-manager image and pass the data (not the binary) between containers.

The multistage build is how I got things working for the customer. Below is the diff of those changes.

diff --git a/vgpu-manager/rhel8/Dockerfile b/vgpu-manager/rhel8/Dockerfile
index da9c12f..76f68a0 100644
--- a/vgpu-manager/rhel8/Dockerfile
+++ b/vgpu-manager/rhel8/Dockerfile
@@ -1,3 +1,8 @@
+FROM registry.redhat.io/ubi9/ubi:9.6 AS ubi9
+RUN mkdir -p /rpms/pciutils
+WORKDIR /rpms/pciutils
+RUN dnf download --resolve pciutils
+
 FROM nvcr.io/nvidia/cuda:12.9.1-base-ubi8
 
 ARG DRIVER_VERSION
@@ -5,7 +10,9 @@ ENV DRIVER_VERSION=$DRIVER_VERSION
 ARG DRIVER_ARCH=x86_64
 ENV DRIVER_ARCH=$DRIVER_ARCH
 
-RUN mkdir -p /driver
+RUN mkdir -p /driver/rpms
+COPY --from=ubi9 /rpms/ /driver/rpms
+
 WORKDIR /driver
 COPY NVIDIA-Linux-${DRIVER_ARCH}-${DRIVER_VERSION}-vgpu-kvm.run .
 RUN chmod +x NVIDIA-Linux-${DRIVER_ARCH}-${DRIVER_VERSION}-vgpu-kvm.run
diff --git a/vgpu-manager/rhel8/ocp_dtk_entrypoint b/vgpu-manager/rhel8/ocp_dtk_entrypoint
index e5d502f..3a21bd2 100755
--- a/vgpu-manager/rhel8/ocp_dtk_entrypoint
+++ b/vgpu-manager/rhel8/ocp_dtk_entrypoint
@@ -96,6 +96,8 @@ dtk-build-driver() {
        "$DRIVER_TOOLKIT_SHARED_DIR/nvidia-driver" \
        "${DRIVER_TOOLKIT_SHARED_DIR}/bin"
 
+    rpm -ivh ${DRIVER_TOOLKIT_SHARED_DIR}/driver/rpms/pciutils/*.rpm
+
     export PATH="${DRIVER_TOOLKIT_SHARED_DIR}/bin:$PATH";
 
     # ensure lspci is installed, as 'sriov-manage' script requires it

@tariq1890
Copy link
Contributor

I like @tariq1890 's idea to bump the base image.

Just to clarify, what I meant was added a new directory for rhel9, the same way we do for the data centre driver

After thinking about this a bit more. Here is the approach I propose:

i) dnf install -y pciutils at build-time
ii) Refactor the code to ensure that sriov-manage is being run from within the main container and not the DTK.

I believe the above is cleaner as we don't need to resort to a hack of downloading rpm, placing it in a shared dir and installing it in a container that was never meant to have it or run it.

Either we implement the above or we convince the DTK developers to include lspci in their container images :)

@empovit empovit changed the title Include lspci in vgpu-manager image [WIP] Include lspci in vgpu-manager image Aug 20, 2025
@empovit empovit marked this pull request as draft August 20, 2025 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants