Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repoConfig is not mounted into GDS container #608

Closed
age9990 opened this issue Nov 13, 2023 · 10 comments
Closed

repoConfig is not mounted into GDS container #608

age9990 opened this issue Nov 13, 2023 · 10 comments

Comments

@age9990
Copy link

age9990 commented Nov 13, 2023

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu20.04
  • Kernel Version: 5.15.x
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): CRI-O
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8s
  • GPU Operator Version: v23.9.0

2. Issue or feature description

  1. From the code we can see the repoConfig is not mounted into GDS container, so the apt repository cannot be set to on-premise repository, causing the container in CrashLoopBackoff state.
    It should contain the following in nvidia-fs-ctr
    https://github.com/NVIDIA/gpu-operator/blob/master/manifests/state-driver/0500_daemonset.yaml
    {{- if and .AdditionalConfigs .AdditionalConfigs.VolumeMounts }}
    {{- range .AdditionalConfigs.VolumeMounts }}

  2. What's more, the GDS image name should concatenate os info, like what we do for nvidia driver pod.
    The default values.yaml, will cause the image pull backoff since the image tag is not correct (missing os, it should be 2.16.1-ubuntu20.04)
    gds:
    version: "2.16.1"
    From the code, the os is not used to construct imagePath.

    func getGDSSpec(spec *nvidiav1alpha1.NVIDIADriverSpec) (*gdsDriverSpec, error) {

    driver image path does reference os.
    https://github.com/NVIDIA/gpu-operator/blob/master/internal/state/driver.go#L472

3. Steps to reproduce the issue

Enable gds then the issue is reproduced.

@shivamerla Please help to resolve these issues to use GDS properly.

@age9990 age9990 changed the title Not able to run GDS container in air-gapped environment repoConfig is not mounted into GDS container Nov 23, 2023
@shivamerla
Copy link
Contributor

@age9990 we are planning to fix this with v23.9.1 (ETA next week). Meanwhile if you want to try out early bits use following.

--set driver.version=535.129.03
--set operator.repository=registry.gitlab.com/nvidia/kubernetes/gpu-operator/staging
--set operator.version=master-latest-ubi8

@age9990
Copy link
Author

age9990 commented Dec 11, 2023

@shivamerla Tried v23.9.1 today, the repoConfig is still not mounted as additional volume. I also tried the cert-config, it is not mounted as well.
As for GDS image tag, it correctly append os info when not enabling nvidiaDriver CRD. However, if I enabled nvidiaDriver CRD, the os info is not appended, causing image pull backoff.

@tariq1890
Copy link
Contributor

@age9990 Can you share the pod yaml, describe output and pod logs when you try it with the NVIDIADriver CR?

@age9990
Copy link
Author

age9990 commented Dec 12, 2023

@tariq1890 Helm values.yaml and driver pod yaml attached.
values.txt
driver_pod.txt

@tariq1890
Copy link
Contributor

can you share the NVIDIADriver CR yaml? You need to make sure that the repoConfig field is set over there just like the ClusterPolicy CR

@age9990
Copy link
Author

age9990 commented Dec 13, 2023

@tariq1890 repoConfig is present in both ClusterPolicy CR and NVIDIADriver CR, as you can see from the driver pod yaml file it is mounted in nvidia-driver-ctr and nvidia-peermem-ctr pod.
nvidiaDriver_cr.txt

@tariq1890
Copy link
Contributor

tariq1890 commented Dec 15, 2023

Hey @age9990 , thanks for bringing this to our notice. We have confirmed that there is a bug with how the GDS container image names are generated. We will publish this to the next planned release

In the meantime can you try this image ?

--set driver.version=535.129.03
--set operator.repository=registry.gitlab.com/nvidia/kubernetes
--set operator.version=72678615-ubi8

@age9990
Copy link
Author

age9990 commented Dec 16, 2023

@tariq1890 Thanks for fixing image name issue. What about the repoConfig volumeMounts issue?
I'm not familiar with GO lang, but I see the code you use to get gdsContainer is different from other functions.
There is no '&' in front of the line, while others do.
gdsContainer := obj.Spec.Template.Spec.Containers[i]
https://github.com/NVIDIA/gpu-operator/blob/fd2b1587d5a8a7cd5a3b28afbf2be80d67d0d3d5/controllers/object_controls.go#L2462

@age9990
Copy link
Author

age9990 commented Jan 4, 2024

Hi, @tariq1890 , I've seen the fix to the issues are merged to master branch, can we expect v23.9.2 be released soon?

@cdesiniotis
Copy link
Contributor

cdesiniotis commented May 2, 2024

Hi @age9990 GPU Operator 24.3.0 has been released and contains a fix for this issue.
https://github.com/NVIDIA/gpu-operator/releases/tag/v24.3.0

I am closing this issue. But please re-open if you still encountering this with 24.3.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants