Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes stuck in upgrade #626

Open
LarsAC opened this issue Dec 4, 2023 · 7 comments
Open

Nodes stuck in upgrade #626

LarsAC opened this issue Dec 4, 2023 · 7 comments

Comments

@LarsAC
Copy link

LarsAC commented Dec 4, 2023

In a Rancher-provisioned bare metal cluster I have two GPU nodes that cannot finish upgrade, their status is validation-required and pod-restart-required.

1. Quick Debug Information

  • OS/Version - Ubuntu22.04:
  • Kernel Version: 5.15.0-89-generic
  • Container Runtime: Docker 24.0
  • K8s Flavor: 1.24.17 (RKE)
  • GPU Operator Version: 23.9.0
  • Operator installation: helm with driver.enabled=false and toolkit.enabled=false

2. Issue or feature description

I had a node running GPU operator. I then added another GPU node but the operator did not successfully pick upo the new node. I then manually upgraded driver version and container toolkit. The two nodes now appear as "Ready", but the gpu operator for the former node is now "validation-required" and for the new node it is "pod-restart-required". I currently have no GPU workloads in my cluster.

Are there any more manual steps to be done to complete the upgrade ?

4. Information to attach (optional if deemed irrelevant)

  • [X ] kubernetes pods status: kubectl get pods -n gpu-operator
gpu-feature-discovery-ndl8l                                  0/1     Init:0/1   0             4m2s
gpu-operator-58bdb8567f-n2bkv                                1/1     Running    3 (12h ago)   13h
gpu-operator-node-feature-discovery-gc-766df9cf89-796dd      1/1     Running    2 (12h ago)   13h
gpu-operator-node-feature-discovery-master-dcb5c5d74-5hf4m   1/1     Running    3 (12h ago)   13h
gpu-operator-node-feature-discovery-worker-54qql             1/1     Running    0             13h
gpu-operator-node-feature-discovery-worker-54zx6             1/1     Running    0             13h
gpu-operator-node-feature-discovery-worker-8pbzw             1/1     Running    0             13h
gpu-operator-node-feature-discovery-worker-k9npn             1/1     Running    0             13h
gpu-operator-node-feature-discovery-worker-swr78             1/1     Running    3 (12h ago)   13h
nvidia-dcgm-exporter-44ln8                                   0/1     Init:0/1   0             13h
nvidia-device-plugin-daemonset-jpn9s                         0/1     Init:0/1   0             13h
nvidia-operator-validator-lc9x6                              0/1     Init:0/4   0             13h

Logs of container toolkit-validation in e.g. pod nvidia-device-plugin-daemonset-jpn9s have lots of messages saying
waiting for nvidia container stack to be setup.

In addition, logs of container driver-validation in pod nvidia-operator-validator-lc9x6 instead have suspicious messages like this:

running command bash with args [-c stat /run/nvidia/validations/.driver-ctr-ready]
stat: cannot statx '/run/nvidia/validations/.driver-ctr-ready': No such file or directory
command failed, retrying after 5 seconds

On the CLI of the nodes itself, nvidia-smi is reporting card and driver fine:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce GTX 1070        On  | 00000000:1C:00.0 Off |                  N/A |
|  0%   33C    P8              12W / 170W |      1MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
@tariq1890
Copy link
Contributor

what is the output of which nvidia-smi when you run it on your node?

@LarsAC
Copy link
Author

LarsAC commented Dec 4, 2023

It is at /usr/bin/nvidia-smi (on both nodes).

@tariq1890
Copy link
Contributor

tariq1890 commented Dec 15, 2023

Sorry for the late response @LarsAC . Can you please run the must-gather.sh script and send the generated artifacts over to [email protected] ?

@LarsAC
Copy link
Author

LarsAC commented Dec 22, 2023

Thanks, archive sent.

@FanKang2021
Copy link

I have encountered the same problem. Have you solved it?

@LarsAC
Copy link
Author

LarsAC commented Apr 16, 2024

Unfortunately not, but have not tried further as I am left without ideas. Will probably try to install the OS from scratch and try again.

@umeshvw
Copy link

umeshvw commented Aug 12, 2024

@tariq1890 / Team , did we find solution for above issues. We have also encountered same issue where nvidia-gpu-operator pods were in init state during ocp cluster upgrade. we were getting below error.

config maps related:
2024-08-02T07:59:48.556383861Z E0802 07:59:48.556378 1 reflector.go:150] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:106: Failed to watch *v1.ConfigMap: failed to list *v1.ConfigMap: configmaps is forbidden: User "system:serviceaccount:nvidia-gpu-operator:gpu-operator" cannot list resource "configmaps" in API group "" at the cluster scope

and also we were receiving below error:
==> nvidia-operator-validator-zf87t
command failed, retrying after 5 seconds
running command bash with args [-c stat /run/nvidia/validations/.driver-ctr-ready]
stat: cannot statx '/run/nvidia/validations/.driver-ctr-ready': No such file or directory
time="2024-08-02T06:36:20Z" level=info msg="Driver is not pre-installed on the host. Checking driver container status."
time="2024-08-02T06:36:20Z" level=info msg="version: 0fe1e8d, commit: 0fe1e8d"

We have created clusterole and clusterolebindings for nvidia-gpu-operator and looks like it resolved the issue but we dont know what caused the issue and what is exact resolution. do you have any idea on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants