-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU Operator crash loop due to missing CRDs #602
Comments
@6ixfalls Thanks for reporting this. Currently due to Helm limitation of handling CRDs with
|
Hi, thank you for the quick resolution. Would it be possible to document this in a more visible location? Currently, I notice the documentation mentions this Helm limitation at the bottom of the Getting Started document, which could be missed easily. Would linking to this notice in the release notes be possible, to remind users about the updates? Even a separate document would be nice, as currently there is a separate document for GPU Driver Upgrades. |
@shivamerla Please make toleration of driver CR cumstomizable, it is now fixed to the following toleration, but in our environment, the GPU is tainted by other key-values, causing the new driver daemonset not schedulable to these GPUs. |
Im having trouble managing the CRDs installation via
Since these errors look like failed openapi validations i guess disabling CRDs installation at the argo application wouldnt do the job here. |
Thanks @age9990 for reporting this, we will fix this with the patch release v23.9.1 which is planned for early November. |
Yes, we are working on this, so far the steps have been updated, but we will move it to make it more visible. |
@oavner CRD handling is a pain currently during upgrades with Helm (or dependent tools). Is it not possible to apply the CRDs manually in your environment before the upgrade of gpu-operator? We will look into other ways to resolve this dependency (possible moving CRDs into a separate chart). |
Thanks for the quick response @shivamerla !! It is not possible to manually apply CRDs since we'd like to deploy the operator to hundreds of environments so we need a fully automated solution that can be reproduced easily. We could go with an ephemeral approach in the cost of application downtime - redeploying the entire environment or just the entire operator, but we can not afford running a shadow replica of the same environment with an upgraded chart version and then reroute traffic to the upgraded environment to create a zero-downtime solution for chart upgrades (like running a simple We deploy other operators such as rabbitmq operator, metallb operator, ingress etc by separating CRs and CRDs to different charts, but its still not really good practice for every controller type; it works for ingress because its CRDs are managed by the administrator in a dedicated "infrastructure chart" and the CRs are self-serviced and used in "applicative charts" (actual company-specific micro-services that are separated from the ingress controller chart and managed by a developer). I actually loved your approach of managing the CRDs via helm hooks and a k8s job so i opened an issue to argoCD as well about these openapi validations, but i've also encountered another approach other then managing CRDs via helm hooks or separated charts - not breaking APIs. Instead of managing versions of the same CRD, some just create new CRD instances from scratch. This might affect the k8s API server if old APIs are not deprecated properly, but might be easier to manage. Does OLM solve this problem? i know ur widely supporting openshift and i wonder if openshift and OLM are able to perform zero-downtime chart and CRD upgrades. Both separating the CRDs to different charts, or supporting zero-downtime chart upgrades with OLM would be awesome, id just love to get this feature and would also be glad to help reaching this goal because of how critical it is to our customers ❤️ |
@shivamerla Two issues to report
|
@age9990 we have called out this limitation in the note here. This is a tech preview feature, so we are not supporting with "upgrade" from existing installs. Also because, when we enable |
@oavner i am not too familiar with ArgoCD but learnt that it applies from manifests directly (i.e with |
thanks @shivamerla, yes i did try it with |
Ah got it, so schema validation is still an issue as that runs before the manifests are applied. Since CRDs are included by default with ArgoCD with every |
As the original issue of this post has been resolved, the issue will be closed, create a new issue if you have any. |
1. Quick Debug Information
2. Issue or feature description
Upgrading the gpu-operator to v23.9.0 should not have the gpu-operator pod be stuck in a crash loop. The error
failed to get API group resources: unable to retrieve the complete list of server APIs: nvidia.com/v1alpha1: the server could not find the requested resource
repeats numerous times in the container logs, before the container stops with the errorfailed to wait for nvidia-driver-controller caches to sync: timed out waiting for cache to be synced for Kind *v1alpha1.NVIDIADriver
. This looks like a regression from the new GPU Driver Custom Resource Definition, and when not deployed, causes the operator to not function properly.3. Steps to reproduce the issue
Install gpu-operator v23.6.1, upgrade the Helm chart to v23.9.0 and observe the gpu-operator pod in a crash loop.
gpu-operator-6fdbc66bd4-k82lb_gpu-operator.log
The text was updated successfully, but these errors were encountered: