Driver Validation Support on Custom Driver Installation Path #659

Dragoncell · 2024-01-20T00:04:26Z

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04):
COS/Ubuntu
Kernel Version:
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker):
Containerd
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS):
GKE
GPU Operator Version:
Any version

2. Issue or feature description

Driver validation is a pre-step for the GPU Operator to work properly. Currently, it supports two places where the driver is installed - default root (/run/nvidia/driver) or root(/). And the validator utilize the nvidia-smi to check the driver by chroot to the driver installation path

However, the way GKE install and manage the GPU driver is not compatible with how GPU operator assumes:
After driver installation, it assumes a file exists /run/nvidia/validations/.driver-ctr-ready. [assertion]
Driver is installed in a custom path (/home/kubernetes/bin/nvidia), the GPU Operator can’t discover the driver’s library unless the path is told to the Operator

To make driver validation compatible with GKE, below are areas requiring changes:

Support custom driver/library path in GPU Operator
When driver enable is set to False, the user can set the specific driver installation path e.g: /home/kubernetes/bin/nvidia and then in GPU Operator, it auto uses this path for its config.
Within validator code logic, when custom driver path detected, it could skip driver-ctr-ready file assertion logic. What’s more, change the way it run nvidia-smi
Support custom driver path within GPU Operator Components (e.g device plugin). Existing device plugin supports custom root, but the operator doesn't support passing the parameter to device plugin, container toolkit and other components

3. Steps to reproduce the issue

Just deploy the GPU Operator on Ubuntu or COS nodes with the GKE installed Driver. and the driver installation check will fail

The text was updated successfully, but these errors were encountered:

cdesiniotis · 2024-01-24T19:16:43Z

@Dragoncell thanks for the details here. It makes sense to make hostDriverRoot configurable and to propagate this setting throughout all of our components which depend on it. In fact, there is an open PR for introducing a similar hostRoot option https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/960. A natural follow-up to this PR would be to introduce a hostDriverRoot option, as hostRoot may not equal hostDriverRoot (as is the case with GKE).

cdesiniotis · 2024-01-24T19:37:10Z

One aspect that I forgot -- the driver installation folder in COS does not represent a "driver root" in the classical Operator sense since we don't have /dev nodes there and one cannot chroot into it. So the enablement here will be more complex than just adding a new hostDriverRoot option.

bobbypage · 2024-01-24T21:20:42Z

/cc

neoaggelos · 2024-01-26T07:35:13Z

Hi folks, indeed we are also interested in this. https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/960 gets us halfway there, and I imagine it is close to get merged.

Afterwards, I will jump in and create a PR for a hostDriverRoot as well.

Glad to see more interest in this, hopefully it will help things to progress faster.

Also thanks a lot @cdesiniotis for the reviews and suggestions.

Dragoncell · 2024-02-13T19:09:18Z

Thanks for taking a look into the issue cdesiniotis, and thanks for the MR from neoaggelos: https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/960

As pointed out from cdesiniotis, our use case is a little different from the custom hostRoot as the driver installed on the host doesn't have a /dev nodes and can't be chrooted. Therefore in our case the configuration looks like

hostRoot = /
hostDriverRoot = /home/kubernetes/bin/nvidia

I see the required changes are like:

Introduce the hostDriverRoot as well similar to the hostRoot MR
Update the validation logic based on the hostDriverRoot (has /dev nodes or not):
a) Replace chroot check if the hostDriverRoot can't be chrooted:

gpu-operator/validator/main.go

Line 627 in 30bc55d

command := "chroot"

b) Under hostRoot check, using the hostDriverRoot instead of /usr/bin:

gpu-operator/validator/main.go

Line 596 in 30bc55d

if fileInfo, err := os.Lstat("/host/usr/bin/nvidia-smi"); err == nil && fileInfo.Size() != 0 {

Do you prefer to make the changes in one MR or separate it from the hostRoot , and glad to see what's your thoughts on it?

Dragoncell · 2024-04-08T18:23:31Z

gitlab commit: https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/1061

Dragoncell mentioned this issue Apr 4, 2024

Create CDI spec error "libcuda.so.535.129.03 not found" in version "v0.15.0-rc.2" NVIDIA/k8s-device-plugin#621

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Driver Validation Support on Custom Driver Installation Path #659

Driver Validation Support on Custom Driver Installation Path #659

Dragoncell commented Jan 20, 2024 •

edited

Loading

cdesiniotis commented Jan 24, 2024

cdesiniotis commented Jan 24, 2024

bobbypage commented Jan 24, 2024

neoaggelos commented Jan 26, 2024 •

edited

Loading

Dragoncell commented Feb 13, 2024 •

edited

Loading

Dragoncell commented Apr 8, 2024

Driver Validation Support on Custom Driver Installation Path #659

Driver Validation Support on Custom Driver Installation Path #659

Comments

Dragoncell commented Jan 20, 2024 • edited Loading

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

cdesiniotis commented Jan 24, 2024

cdesiniotis commented Jan 24, 2024

bobbypage commented Jan 24, 2024

neoaggelos commented Jan 26, 2024 • edited Loading

Dragoncell commented Feb 13, 2024 • edited Loading

Dragoncell commented Apr 8, 2024

Dragoncell commented Jan 20, 2024 •

edited

Loading

neoaggelos commented Jan 26, 2024 •

edited

Loading

Dragoncell commented Feb 13, 2024 •

edited

Loading