Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Driver Validation Support on Custom Driver Installation Path #659

Open
Dragoncell opened this issue Jan 20, 2024 · 6 comments
Open

Driver Validation Support on Custom Driver Installation Path #659

Dragoncell opened this issue Jan 20, 2024 · 6 comments

Comments

@Dragoncell
Copy link

Dragoncell commented Jan 20, 2024

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04):
    COS/Ubuntu
  • Kernel Version:
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker):
    Containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS):
    GKE
  • GPU Operator Version:
    Any version

2. Issue or feature description

Driver validation is a pre-step for the GPU Operator to work properly. Currently, it supports two places where the driver is installed - default root (/run/nvidia/driver) or root(/). And the validator utilize the nvidia-smi to check the driver by chroot to the driver installation path

However, the way GKE install and manage the GPU driver is not compatible with how GPU operator assumes:
After driver installation, it assumes a file exists /run/nvidia/validations/.driver-ctr-ready. [assertion]
Driver is installed in a custom path (/home/kubernetes/bin/nvidia), the GPU Operator can’t discover the driver’s library unless the path is told to the Operator

To make driver validation compatible with GKE, below are areas requiring changes:

  1. Support custom driver/library path in GPU Operator
    When driver enable is set to False, the user can set the specific driver installation path e.g: /home/kubernetes/bin/nvidia and then in GPU Operator, it auto uses this path for its config.
    Within validator code logic, when custom driver path detected, it could skip driver-ctr-ready file assertion logic. What’s more, change the way it run nvidia-smi

  2. Support custom driver path within GPU Operator Components (e.g device plugin). Existing device plugin supports custom root, but the operator doesn't support passing the parameter to device plugin, container toolkit and other components

3. Steps to reproduce the issue

Just deploy the GPU Operator on Ubuntu or COS nodes with the GKE installed Driver. and the driver installation check will fail

@cdesiniotis
Copy link
Contributor

@Dragoncell thanks for the details here. It makes sense to make hostDriverRoot configurable and to propagate this setting throughout all of our components which depend on it. In fact, there is an open PR for introducing a similar hostRoot option https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/960. A natural follow-up to this PR would be to introduce a hostDriverRoot option, as hostRoot may not equal hostDriverRoot (as is the case with GKE).

@cdesiniotis
Copy link
Contributor

One aspect that I forgot -- the driver installation folder in COS does not represent a "driver root" in the classical Operator sense since we don't have /dev nodes there and one cannot chroot into it. So the enablement here will be more complex than just adding a new hostDriverRoot option.

@bobbypage
Copy link

/cc

@neoaggelos
Copy link
Contributor

neoaggelos commented Jan 26, 2024

Hi folks, indeed we are also interested in this. https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/960 gets us halfway there, and I imagine it is close to get merged.

Afterwards, I will jump in and create a PR for a hostDriverRoot as well.

Glad to see more interest in this, hopefully it will help things to progress faster.

Also thanks a lot @cdesiniotis for the reviews and suggestions.

@Dragoncell
Copy link
Author

Dragoncell commented Feb 13, 2024

Thanks for taking a look into the issue cdesiniotis, and thanks for the MR from neoaggelos: https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/960

As pointed out from cdesiniotis, our use case is a little different from the custom hostRoot as the driver installed on the host doesn't have a /dev nodes and can't be chrooted. Therefore in our case the configuration looks like

hostRoot = /
hostDriverRoot = /home/kubernetes/bin/nvidia 

I see the required changes are like:

  1. Introduce the hostDriverRoot as well similar to the hostRoot MR
  2. Update the validation logic based on the hostDriverRoot (has /dev nodes or not):
    a) Replace chroot check if the hostDriverRoot can't be chrooted:
    command := "chroot"

    b) Under hostRoot check, using the hostDriverRoot instead of /usr/bin:
    if fileInfo, err := os.Lstat("/host/usr/bin/nvidia-smi"); err == nil && fileInfo.Size() != 0 {

Do you prefer to make the changes in one MR or separate it from the hostRoot , and glad to see what's your thoughts on it?

@Dragoncell
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants