-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nodes stuck in upgrade #626
Comments
what is the output of |
It is at |
Sorry for the late response @LarsAC . Can you please run the |
Thanks, archive sent. |
I have encountered the same problem. Have you solved it? |
Unfortunately not, but have not tried further as I am left without ideas. Will probably try to install the OS from scratch and try again. |
@tariq1890 / Team , did we find solution for above issues. We have also encountered same issue where nvidia-gpu-operator pods were in init state during ocp cluster upgrade. we were getting below error. config maps related: and also we were receiving below error: We have created clusterole and clusterolebindings for nvidia-gpu-operator and looks like it resolved the issue but we dont know what caused the issue and what is exact resolution. do you have any idea on this? |
In a Rancher-provisioned bare metal cluster I have two GPU nodes that cannot finish upgrade, their status is validation-required and pod-restart-required.
1. Quick Debug Information
driver.enabled=false
andtoolkit.enabled=false
2. Issue or feature description
I had a node running GPU operator. I then added another GPU node but the operator did not successfully pick upo the new node. I then manually upgraded driver version and container toolkit. The two nodes now appear as "Ready", but the gpu operator for the former node is now "validation-required" and for the new node it is "pod-restart-required". I currently have no GPU workloads in my cluster.
Are there any more manual steps to be done to complete the upgrade ?
4. Information to attach (optional if deemed irrelevant)
kubectl get pods -n gpu-operator
Logs of container
toolkit-validation
in e.g. podnvidia-device-plugin-daemonset-jpn9s
have lots of messages sayingwaiting for nvidia container stack to be setup
.In addition, logs of container
driver-validation
in podnvidia-operator-validator-lc9x6
instead have suspicious messages like this:On the CLI of the nodes itself,
nvidia-smi
is reporting card and driver fine:The text was updated successfully, but these errors were encountered: