-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Driver Validation Support on Custom Driver Installation Path #659
Comments
@Dragoncell thanks for the details here. It makes sense to make |
One aspect that I forgot -- the driver installation folder in COS does not represent a "driver root" in the classical Operator sense since we don't have |
/cc |
Hi folks, indeed we are also interested in this. https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/960 gets us halfway there, and I imagine it is close to get merged. Afterwards, I will jump in and create a PR for a Glad to see more interest in this, hopefully it will help things to progress faster. Also thanks a lot @cdesiniotis for the reviews and suggestions. |
Thanks for taking a look into the issue cdesiniotis, and thanks for the MR from neoaggelos: https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/960 As pointed out from cdesiniotis, our use case is a little different from the custom hostRoot as the driver installed on the host doesn't have a /dev nodes and can't be chrooted. Therefore in our case the configuration looks like
I see the required changes are like:
Do you prefer to make the changes in one MR or separate it from the hostRoot , and glad to see what's your thoughts on it? |
1. Quick Debug Information
COS/Ubuntu
Containerd
GKE
Any version
2. Issue or feature description
Driver validation is a pre-step for the GPU Operator to work properly. Currently, it supports two places where the driver is installed - default root (/run/nvidia/driver) or root(/). And the validator utilize the nvidia-smi to check the driver by chroot to the driver installation path
However, the way GKE install and manage the GPU driver is not compatible with how GPU operator assumes:
After driver installation, it assumes a file exists
/run/nvidia/validations/.driver-ctr-ready
. [assertion]Driver is installed in a custom path (/home/kubernetes/bin/nvidia), the GPU Operator can’t discover the driver’s library unless the path is told to the Operator
To make driver validation compatible with GKE, below are areas requiring changes:
Support custom driver/library path in GPU Operator
When driver enable is set to False, the user can set the specific driver installation path e.g: /home/kubernetes/bin/nvidia and then in GPU Operator, it auto uses this path for its config.
Within validator code logic, when custom driver path detected, it could skip driver-ctr-ready file assertion logic. What’s more, change the way it run nvidia-smi
Support custom driver path within GPU Operator Components (e.g device plugin). Existing device plugin supports custom root, but the operator doesn't support passing the parameter to device plugin, container toolkit and other components
3. Steps to reproduce the issue
Just deploy the GPU Operator on Ubuntu or COS nodes with the GKE installed Driver. and the driver installation check will fail
The text was updated successfully, but these errors were encountered: