gpu driver is in init state after rebooting the gpu node #566

alloydm · 2023-08-09T07:23:36Z

1. Quick Debug Information

OS/Version: Rhel 8.8
Kernel Version: 4.18.0-477.15.1.el8_8.x86_64
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): k8s (v1.24.12)
GPU Operator Version: 23.3.2

2. Issue or feature description

I have a kubernetes cluster with gpu operator installer (23.3.2) on Tesla p4 gpu node, I am running kubeflow based jupyter notebook which consumes gpu node. This kubeflow based jupyter notebook pod(statefulset as replication controller) also has Persistent volume claims attached to it.
Whenever the gpu node is rebooted, the driver-daemonset pod stucks in init stage, that is the k8s-driver-manager (container) will be stuck in evicting the kubeflow jupyter notebook pod, only when we forecfully delete the notebook pod, the driver daeomonset goes ahead with execution
kubectl delete pod juypter-nb --force --grace-period=0

I have attached the k8s-driver-manager container's environmental variables that I have set

3. Steps to reproduce the issue

Create a kubernetes cluster with Rhel 8.8 OS and deploy gpu operator(23.3.2) using helm
Create a kubeflow based jupyter notebook statefulset(on gpu node) consuming Persistent volume claim, which will utilise gpu
once the notebook pod is up and running on the gpu node, we have to reboot that gpu node.

4. Information to attach (optional if deemed irrelevant)

gpu pods status: kubectl get po -n gpu-operator
driver-daeomonset logs: kubectl logs nvidia-driver-daemonset-stmk7 -n gpu-operator -f -c k8s-driver-manager
kubeflow pod status kubectl get pod -n admin

The text was updated successfully, but these errors were encountered:

shivamerla · 2023-08-10T05:48:52Z

@alloydm thanks for reporting this. When nvidia driver modules are not loaded (during reboot scenario), we can avoid evicting user GPU pods. Will address this in next patch release.

shivamerla · 2023-09-28T06:18:10Z

@alloydm there are couple of ways this can be mitigated.

Enable driver upgrade controller with driver.upgradePolicy.autoUpgrade=true, in that case the initContainer will not handle GPU pod eviction, but the upgrade controller within the operator. This is triggered only during driver daemonset spec updates and not host reboot.
Disable "ENABLE_GPU_POD_EVICTION" with the driver manager. With this disabled, on node reboot, since no driver is loaded, we do not attempt GPU pod eviction or node drain. But in cases when the driver container restarts abruptively, it will not evict GPU pods and will be stuck in crashloop.

We will add a fix to avoid nvdrain during node reboot case.

alloydm · 2023-10-16T06:59:29Z

@shivamerla
we dont want to enable autoUpgrade to true
Disable "ENABLE_GPU_POD_EVICTION" with the driver manager - tried this, but since this is a statefulset controlled pod, this pod goes to terminating stage when node goes down, and then stays in terminating state forever

I am attaching kubernetes doc information on why this is happening
https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/

We are not hitting this issue with upgrade as there is option in driver env to forcefully deleting user gpu pod. can we have that forcefully deleting user gpu pod env here too?

shivamerla added the enhancement label Aug 29, 2023

ArangoGutierrez removed the enhancement label Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gpu driver is in init state after rebooting the gpu node #566

gpu driver is in init state after rebooting the gpu node #566

alloydm commented Aug 9, 2023

shivamerla commented Aug 10, 2023

shivamerla commented Sep 28, 2023

alloydm commented Oct 16, 2023

gpu driver is in init state after rebooting the gpu node #566

gpu driver is in init state after rebooting the gpu node #566

Comments

alloydm commented Aug 9, 2023

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

4. Information to attach (optional if deemed irrelevant)

shivamerla commented Aug 10, 2023

shivamerla commented Sep 28, 2023

alloydm commented Oct 16, 2023