Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Operator crash loop due to missing CRDs #602

Closed
6ixfalls opened this issue Oct 24, 2023 · 14 comments
Closed

GPU Operator crash loop due to missing CRDs #602

6ixfalls opened this issue Oct 24, 2023 · 14 comments

Comments

@6ixfalls
Copy link

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 22.04.3 LTS
  • Kernel Version: 5.15.0-87-generic x86_64
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K3S
  • GPU Operator Version: v23.9.0

2. Issue or feature description

Upgrading the gpu-operator to v23.9.0 should not have the gpu-operator pod be stuck in a crash loop. The error failed to get API group resources: unable to retrieve the complete list of server APIs: nvidia.com/v1alpha1: the server could not find the requested resource repeats numerous times in the container logs, before the container stops with the error failed to wait for nvidia-driver-controller caches to sync: timed out waiting for cache to be synced for Kind *v1alpha1.NVIDIADriver. This looks like a regression from the new GPU Driver Custom Resource Definition, and when not deployed, causes the operator to not function properly.

3. Steps to reproduce the issue

Install gpu-operator v23.6.1, upgrade the Helm chart to v23.9.0 and observe the gpu-operator pod in a crash loop.

gpu-operator-6fdbc66bd4-k82lb_gpu-operator.log

@6ixfalls 6ixfalls changed the title GPU Operator GPU Operator crash loop due to missing CRDs Oct 24, 2023
@shivamerla
Copy link
Contributor

shivamerla commented Oct 24, 2023

@6ixfalls Thanks for reporting this. Currently due to Helm limitation of handling CRDs with upgrade, it is required to manually upgrade/install all CRDs that are part of the chart. Below are the options.

  1. Manually install/upgrade CRDs using steps here. This doc needs to be updated to include kubectl apply -f https://gitlab.com/nvidia/kubernetes/gpu-operator/-/raw/release-23.9/deployments/gpu-operator/crds/nvidia.com_nvidiadrivers.yaml command as well. Will fix the docs.
  2. Pass the upgrade option of --set operator.upgradeCRD=true to automatically upgrade/install all CRDs.

@6ixfalls
Copy link
Author

Hi, thank you for the quick resolution. Would it be possible to document this in a more visible location? Currently, I notice the documentation mentions this Helm limitation at the bottom of the Getting Started document, which could be missed easily. Would linking to this notice in the release notes be possible, to remind users about the updates? Even a separate document would be nice, as currently there is a separate document for GPU Driver Upgrades.

@age9990
Copy link

age9990 commented Oct 26, 2023

@shivamerla Please make toleration of driver CR cumstomizable, it is now fixed to the following toleration, but in our environment, the GPU is tainted by other key-values, causing the new driver daemonset not schedulable to these GPUs.
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule

@oavner
Copy link

oavner commented Oct 26, 2023

Im having trouble managing the CRDs installation via argoCD.
Your documentation recommends installing the CRDs either manually, by OLM or by using helm hooks (adding the --disable-openapi-validation option to the helm installation).
Since im not using openshift but pure K8S, OLM does not come out the box for me which means that my only real options are either deploying and maintaining OLM or making argoCD skip openapi validations somehow and installing CRDs via helm hooks.
I tried skipping the openapi validations by disabling CRDs installation at the argo application (spec.helm.skipCrds=true) but i still get schema errors when upgrading the chart version (replacing the CRDs).
I tested an upgrade from both v23.3.2 to v23.6.0 and from v23.6.1 to v23.9.0 and i got errors similar to this one:

error calculating structured merge diff: error building typed value from config resource: .spec.psa: field not declared in schema

Since these errors look like failed openapi validations i guess disabling CRDs installation at the argo application wouldnt do the job here.
I don't know about a supported way to natively disable openapi validations in argoCD, so currently my only option is deleting the entire gpu operator in order to upgrade some driver versions (for example driver version 535-104-12 is only supported in chart version >= v23.6.1).
This means that managing the gpu-operator with argoCD will always drag a downtime for our application. It is critical to have a zero-downtime upgrades solution that is also supported by tools other then pure helm or a robust OLM setup since lots of production applications would love to both manage robust deployments with wrapper tools to helm (argo for example) but not couple with an encapsulated or just over-engineered solution such as OLM (or openshift).
How do u suggest tackling this? are you working on a similar solution? should i create a PR supporting argoCD and not only the classic helm binary? do u know about other simple solutions to this scenario? lemme know :))

@shivamerla
Copy link
Contributor

@shivamerla Please make toleration of driver CR cumstomizable, it is now fixed to the following toleration, but in our environment, the GPU is tainted by other key-values, causing the new driver daemonset not schedulable to these GPUs. tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule

Thanks @age9990 for reporting this, we will fix this with the patch release v23.9.1 which is planned for early November.

@shivamerla
Copy link
Contributor

Hi, thank you for the quick resolution. Would it be possible to document this in a more visible location? Currently, I notice the documentation mentions this Helm limitation at the bottom of the Getting Started document, which could be missed easily. Would linking to this notice in the release notes be possible, to remind users about the updates? Even a separate document would be nice, as currently there is a separate document for GPU Driver Upgrades.

Yes, we are working on this, so far the steps have been updated, but we will move it to make it more visible.

@shivamerla
Copy link
Contributor

Im having trouble managing the CRDs installation via argoCD. Your documentation recommends installing the CRDs either manually, by OLM or by using helm hooks (adding the --disable-openapi-validation option to the helm installation). Since im not using openshift but pure K8S, OLM does not come out the box for me which means that my only real options are either deploying and maintaining OLM or making argoCD skip openapi validations somehow and installing CRDs via helm hooks. I tried skipping the openapi validations by disabling CRDs installation at the argo application (spec.helm.skipCrds=true) but i still get schema errors when upgrading the chart version (replacing the CRDs). I tested an upgrade from both v23.3.2 to v23.6.0 and from v23.6.1 to v23.9.0 and i got errors similar to this one:

error calculating structured merge diff: error building typed value from config resource: .spec.psa: field not declared in schema

Since these errors look like failed openapi validations i guess disabling CRDs installation at the argo application wouldnt do the job here. I don't know about a supported way to natively disable openapi validations in argoCD, so currently my only option is deleting the entire gpu operator in order to upgrade some driver versions (for example driver version 535-104-12 is only supported in chart version >= v23.6.1). This means that managing the gpu-operator with argoCD will always drag a downtime for our application. It is critical to have a zero-downtime upgrades solution that is also supported by tools other then pure helm or a robust OLM setup since lots of production applications would love to both manage robust deployments with wrapper tools to helm (argo for example) but not couple with an encapsulated or just over-engineered solution such as OLM (or openshift). How do u suggest tackling this? are you working on a similar solution? should i create a PR supporting argoCD and not only the classic helm binary? do u know about other simple solutions to this scenario? lemme know :))

@oavner CRD handling is a pain currently during upgrades with Helm (or dependent tools). Is it not possible to apply the CRDs manually in your environment before the upgrade of gpu-operator? We will look into other ways to resolve this dependency (possible moving CRDs into a separate chart).

@oavner
Copy link

oavner commented Oct 29, 2023

Thanks for the quick response @shivamerla !!

It is not possible to manually apply CRDs since we'd like to deploy the operator to hundreds of environments so we need a fully automated solution that can be reproduced easily.

We could go with an ephemeral approach in the cost of application downtime - redeploying the entire environment or just the entire operator, but we can not afford running a shadow replica of the same environment with an upgraded chart version and then reroute traffic to the upgraded environment to create a zero-downtime solution for chart upgrades (like running a simple Deployment object with maxSurge and maxUnavailable for example).

We deploy other operators such as rabbitmq operator, metallb operator, ingress etc by separating CRs and CRDs to different charts, but its still not really good practice for every controller type; it works for ingress because its CRDs are managed by the administrator in a dedicated "infrastructure chart" and the CRs are self-serviced and used in "applicative charts" (actual company-specific micro-services that are separated from the ingress controller chart and managed by a developer).
In this case (gpu-operator case) the administrator manages both the CRDs as well as the CRs, so separating them to different charts will solve our problem in a way but will also create overhead (reasonable overhead indeed, but yet to become best practice).

I actually loved your approach of managing the CRDs via helm hooks and a k8s job so i opened an issue to argoCD as well about these openapi validations, but i've also encountered another approach other then managing CRDs via helm hooks or separated charts - not breaking APIs. Instead of managing versions of the same CRD, some just create new CRD instances from scratch. This might affect the k8s API server if old APIs are not deprecated properly, but might be easier to manage.

Does OLM solve this problem? i know ur widely supporting openshift and i wonder if openshift and OLM are able to perform zero-downtime chart and CRD upgrades. Both separating the CRDs to different charts, or supporting zero-downtime chart upgrades with OLM would be awesome, id just love to get this feature and would also be glad to help reaching this goal because of how critical it is to our customers ❤️

@age9990
Copy link

age9990 commented Oct 31, 2023

@shivamerla Two issues to report

  1. When I upgrade to v23.9.0 with nvidiaDriverCRD and deployDefaultCR set to true, even if I set operator.upgradeCRD=true and set --disable-openapi-validation flag, the helm upgrade still failed with CRD not found error message.
    I noticed the pre-upgrade hook is not yet executed, so indeed the new CRD is not installed.
    I can manually install the new CRD nvidiadrivers.nvidia.com to workaround this issue, just hope a fully automated upgrade process.
  2. With operator.cleanupCRD=true, CRD nvidiadrivers.nvidia.com is not deleted after uninstalling the chart.

@shivamerla
Copy link
Contributor

@shivamerla Two issues to report

  1. When I upgrade to v23.9.0 with nvidiaDriverCRD and deployDefaultCR set to true, even if I set operator.upgradeCRD=true and set --disable-openapi-validation flag, the helm upgrade still failed with CRD not found error message.
    I noticed the pre-upgrade hook is not yet executed, so indeed the new CRD is not installed.
    I can manually install the new CRD nvidiadrivers.nvidia.com to workaround this issue, just hope a fully automated upgrade process.
  2. With operator.cleanupCRD=true, CRD nvidiadrivers.nvidia.com is not deleted after uninstalling the chart.

@age9990 we have called out this limitation in the note here. This is a tech preview feature, so we are not supporting with "upgrade" from existing installs. Also because, when we enable default NVIDIADriver CR, all running driver pods managed by ClusterPolicy will be cleaned up. We want to handle that more gracefully when we GA this feature. Regarding cleanupCRD, it was intentional to not cleanup NVIDIADriver CRD as there can be CRs created by the user. We don't cleanup those CRs and driver pods will continue to run. User has to manually cleanup those CRs.

@shivamerla
Copy link
Contributor

shivamerla commented Nov 3, 2023

Thanks for the quick response @shivamerla !!

It is not possible to manually apply CRDs since we'd like to deploy the operator to hundreds of environments so we need a fully automated solution that can be reproduced easily.

We could go with an ephemeral approach in the cost of application downtime - redeploying the entire environment or just the entire operator, but we can not afford running a shadow replica of the same environment with an upgraded chart version and then reroute traffic to the upgraded environment to create a zero-downtime solution for chart upgrades (like running a simple Deployment object with maxSurge and maxUnavailable for example).

We deploy other operators such as rabbitmq operator, metallb operator, ingress etc by separating CRs and CRDs to different charts, but its still not really good practice for every controller type; it works for ingress because its CRDs are managed by the administrator in a dedicated "infrastructure chart" and the CRs are self-serviced and used in "applicative charts" (actual company-specific micro-services that are separated from the ingress controller chart and managed by a developer). In this case (gpu-operator case) the administrator manages both the CRDs as well as the CRs, so separating them to different charts will solve our problem in a way but will also create overhead (reasonable overhead indeed, but yet to become best practice).

I actually loved your approach of managing the CRDs via helm hooks and a k8s job so i opened an issue to argoCD as well about these openapi validations, but i've also encountered another approach other then managing CRDs via helm hooks or separated charts - not breaking APIs. Instead of managing versions of the same CRD, some just create new CRD instances from scratch. This might affect the k8s API server if old APIs are not deprecated properly, but might be easier to manage.

Does OLM solve this problem? i know ur widely supporting openshift and i wonder if openshift and OLM are able to perform zero-downtime chart and CRD upgrades. Both separating the CRDs to different charts, or supporting zero-downtime chart upgrades with OLM would be awesome, id just love to get this feature and would also be glad to help reaching this goal because of how critical it is to our customers ❤️

@oavner i am not too familiar with ArgoCD but learnt that it applies from manifests directly (i.e with helm template <chart> | kubectl apply -f). Since every operation is a "sync" it should apply CRD changes automatically. Did you try with operator.upgradeCRD=false and verify that CRDs are getting updated with every sync?

@oavner
Copy link

oavner commented Nov 9, 2023

thanks @shivamerla, yes i did try it with operator.upgradeCRD=false for versions 23.3.2, 23.6.0, 23.6.1 and 23.9.0 they all ended up with unsuccessful upgrades and errors showing field not declared in schema. Its because were deploying CRs of CRDs that does not yet exists in the cluster....

@shivamerla
Copy link
Contributor

thanks @shivamerla, yes i did try it with operator.upgradeCRD=false for versions 23.3.2, 23.6.0, 23.6.1 and 23.9.0 they all ended up with unsuccessful upgrades and errors showing field not declared in schema. Its because were deploying CRs of CRDs that does not yet exists in the cluster....

Ah got it, so schema validation is still an issue as that runs before the manifests are applied. Since CRDs are included by default with ArgoCD with every sync, thought this is not an issue. Lets see if we get any traction on the issue you have raised here. We will also evaluate more about separating CRDs into a separate chart.

@6ixfalls
Copy link
Author

As the original issue of this post has been resolved, the issue will be closed, create a new issue if you have any.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants