Nodes stuck in upgrade #626

LarsAC · 2023-12-04T08:31:42Z

In a Rancher-provisioned bare metal cluster I have two GPU nodes that cannot finish upgrade, their status is validation-required and pod-restart-required.

1. Quick Debug Information

OS/Version - Ubuntu22.04:
Kernel Version: 5.15.0-89-generic
Container Runtime: Docker 24.0
K8s Flavor: 1.24.17 (RKE)
GPU Operator Version: 23.9.0
Operator installation: helm with driver.enabled=false and toolkit.enabled=false

2. Issue or feature description

I had a node running GPU operator. I then added another GPU node but the operator did not successfully pick upo the new node. I then manually upgraded driver version and container toolkit. The two nodes now appear as "Ready", but the gpu operator for the former node is now "validation-required" and for the new node it is "pod-restart-required". I currently have no GPU workloads in my cluster.

Are there any more manual steps to be done to complete the upgrade ?

4. Information to attach (optional if deemed irrelevant)

[X ] kubernetes pods status: kubectl get pods -n gpu-operator

gpu-feature-discovery-ndl8l                                  0/1     Init:0/1   0             4m2s
gpu-operator-58bdb8567f-n2bkv                                1/1     Running    3 (12h ago)   13h
gpu-operator-node-feature-discovery-gc-766df9cf89-796dd      1/1     Running    2 (12h ago)   13h
gpu-operator-node-feature-discovery-master-dcb5c5d74-5hf4m   1/1     Running    3 (12h ago)   13h
gpu-operator-node-feature-discovery-worker-54qql             1/1     Running    0             13h
gpu-operator-node-feature-discovery-worker-54zx6             1/1     Running    0             13h
gpu-operator-node-feature-discovery-worker-8pbzw             1/1     Running    0             13h
gpu-operator-node-feature-discovery-worker-k9npn             1/1     Running    0             13h
gpu-operator-node-feature-discovery-worker-swr78             1/1     Running    3 (12h ago)   13h
nvidia-dcgm-exporter-44ln8                                   0/1     Init:0/1   0             13h
nvidia-device-plugin-daemonset-jpn9s                         0/1     Init:0/1   0             13h
nvidia-operator-validator-lc9x6                              0/1     Init:0/4   0             13h

Logs of container toolkit-validation in e.g. pod nvidia-device-plugin-daemonset-jpn9s have lots of messages saying
waiting for nvidia container stack to be setup.

In addition, logs of container driver-validation in pod nvidia-operator-validator-lc9x6 instead have suspicious messages like this:

running command bash with args [-c stat /run/nvidia/validations/.driver-ctr-ready]
stat: cannot statx '/run/nvidia/validations/.driver-ctr-ready': No such file or directory
command failed, retrying after 5 seconds

On the CLI of the nodes itself, nvidia-smi is reporting card and driver fine:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1070        On  | 00000000:1C:00.0 Off |                  N/A |
|  0%   33C    P8              12W / 170W |      1MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

The text was updated successfully, but these errors were encountered:

tariq1890 · 2023-12-04T18:02:30Z

what is the output of which nvidia-smi when you run it on your node?

LarsAC · 2023-12-04T18:20:42Z

It is at /usr/bin/nvidia-smi (on both nodes).

tariq1890 · 2023-12-15T22:37:52Z

Sorry for the late response @LarsAC . Can you please run the must-gather.sh script and send the generated artifacts over to [email protected] ?

LarsAC · 2023-12-22T20:19:52Z

Thanks, archive sent.

FanKang2021 · 2024-04-15T08:27:15Z

I have encountered the same problem. Have you solved it?

LarsAC · 2024-04-16T19:48:11Z

Unfortunately not, but have not tried further as I am left without ideas. Will probably try to install the OS from scratch and try again.

umeshvw · 2024-08-12T06:49:04Z

@tariq1890 / Team , did we find solution for above issues. We have also encountered same issue where nvidia-gpu-operator pods were in init state during ocp cluster upgrade. we were getting below error.

config maps related:
2024-08-02T07:59:48.556383861Z E0802 07:59:48.556378 1 reflector.go:150] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:106: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:nvidia-gpu-operator:gpu-operator" cannot list resource "configmaps" in API group "" at the cluster scope

and also we were receiving below error:
==> nvidia-operator-validator-zf87t
command failed, retrying after 5 seconds
running command bash with args [-c stat /run/nvidia/validations/.driver-ctr-ready]
stat: cannot statx '/run/nvidia/validations/.driver-ctr-ready': No such file or directory
time="2024-08-02T06:36:20Z" level=info msg="Driver is not pre-installed on the host. Checking driver container status."
time="2024-08-02T06:36:20Z" level=info msg="version: 0fe1e8d, commit: 0fe1e8d"

We have created clusterole and clusterolebindings for nvidia-gpu-operator and looks like it resolved the issue but we dont know what caused the issue and what is exact resolution. do you have any idea on this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nodes stuck in upgrade #626

Nodes stuck in upgrade #626

LarsAC commented Dec 4, 2023

tariq1890 commented Dec 4, 2023

LarsAC commented Dec 4, 2023 •

edited

Loading

tariq1890 commented Dec 15, 2023 •

edited

Loading

LarsAC commented Dec 22, 2023

FanKang2021 commented Apr 15, 2024

LarsAC commented Apr 16, 2024

umeshvw commented Aug 12, 2024

Nodes stuck in upgrade #626

Nodes stuck in upgrade #626

Comments

LarsAC commented Dec 4, 2023

1. Quick Debug Information

2. Issue or feature description

4. Information to attach (optional if deemed irrelevant)

tariq1890 commented Dec 4, 2023

LarsAC commented Dec 4, 2023 • edited Loading

tariq1890 commented Dec 15, 2023 • edited Loading

LarsAC commented Dec 22, 2023

FanKang2021 commented Apr 15, 2024

LarsAC commented Apr 16, 2024

umeshvw commented Aug 12, 2024

LarsAC commented Dec 4, 2023 •

edited

Loading

tariq1890 commented Dec 15, 2023 •

edited

Loading