-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
repoConfig is not mounted into GDS container #608
Comments
@age9990 we are planning to fix this with v23.9.1 (ETA next week). Meanwhile if you want to try out early bits use following.
|
@shivamerla Tried v23.9.1 today, the repoConfig is still not mounted as additional volume. I also tried the cert-config, it is not mounted as well. |
@age9990 Can you share the pod yaml, describe output and pod logs when you try it with the NVIDIADriver CR? |
@tariq1890 Helm values.yaml and driver pod yaml attached. |
can you share the NVIDIADriver CR yaml? You need to make sure that the |
@tariq1890 repoConfig is present in both ClusterPolicy CR and NVIDIADriver CR, as you can see from the driver pod yaml file it is mounted in nvidia-driver-ctr and nvidia-peermem-ctr pod. |
Hey @age9990 , thanks for bringing this to our notice. We have confirmed that there is a bug with how the GDS container image names are generated. We will publish this to the next planned release In the meantime can you try this image ?
|
@tariq1890 Thanks for fixing image name issue. What about the repoConfig volumeMounts issue? |
Hi, @tariq1890 , I've seen the fix to the issues are merged to master branch, can we expect v23.9.2 be released soon? |
Hi @age9990 GPU Operator 24.3.0 has been released and contains a fix for this issue. I am closing this issue. But please re-open if you still encountering this with 24.3.0. |
1. Quick Debug Information
2. Issue or feature description
From the code we can see the repoConfig is not mounted into GDS container, so the apt repository cannot be set to on-premise repository, causing the container in CrashLoopBackoff state.
It should contain the following in nvidia-fs-ctr
https://github.com/NVIDIA/gpu-operator/blob/master/manifests/state-driver/0500_daemonset.yaml
{{- if and .AdditionalConfigs .AdditionalConfigs.VolumeMounts }}
{{- range .AdditionalConfigs.VolumeMounts }}
What's more, the GDS image name should concatenate os info, like what we do for nvidia driver pod.
The default values.yaml, will cause the image pull backoff since the image tag is not correct (missing os, it should be 2.16.1-ubuntu20.04)
gds:
version: "2.16.1"
From the code, the os is not used to construct imagePath.
gpu-operator/internal/state/driver.go
Line 533 in 79fe1cc
driver image path does reference os.
https://github.com/NVIDIA/gpu-operator/blob/master/internal/state/driver.go#L472
3. Steps to reproduce the issue
Enable gds then the issue is reproduced.
@shivamerla Please help to resolve these issues to use GDS properly.
The text was updated successfully, but these errors were encountered: