Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu driver is in init state after rebooting the gpu node #566

Open
3 tasks
alloydm opened this issue Aug 9, 2023 · 3 comments
Open
3 tasks

gpu driver is in init state after rebooting the gpu node #566

alloydm opened this issue Aug 9, 2023 · 3 comments

Comments

@alloydm
Copy link

alloydm commented Aug 9, 2023

1. Quick Debug Information

  • OS/Version: Rhel 8.8
  • Kernel Version: 4.18.0-477.15.1.el8_8.x86_64
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): k8s (v1.24.12)
  • GPU Operator Version: 23.3.2

2. Issue or feature description

I have a kubernetes cluster with gpu operator installer (23.3.2) on Tesla p4 gpu node, I am running kubeflow based jupyter notebook which consumes gpu node. This kubeflow based jupyter notebook pod(statefulset as replication controller) also has Persistent volume claims attached to it.
Whenever the gpu node is rebooted, the driver-daemonset pod stucks in init stage, that is the k8s-driver-manager (container) will be stuck in evicting the kubeflow jupyter notebook pod, only when we forecfully delete the notebook pod, the driver daeomonset goes ahead with execution
kubectl delete pod juypter-nb --force --grace-period=0

Screenshot 2023-08-09 at 11 52 18 AM Screenshot 2023-08-09 at 11 54 35 AM

I have attached the k8s-driver-manager container's environmental variables that I have set
Screenshot 2023-08-09 at 12 39 07 PM

3. Steps to reproduce the issue

  1. Create a kubernetes cluster with Rhel 8.8 OS and deploy gpu operator(23.3.2) using helm
  2. Create a kubeflow based jupyter notebook statefulset(on gpu node) consuming Persistent volume claim, which will utilise gpu
  3. once the notebook pod is up and running on the gpu node, we have to reboot that gpu node.

4. Information to attach (optional if deemed irrelevant)

  • gpu pods status: kubectl get po -n gpu-operator
  • driver-daeomonset logs: kubectl logs nvidia-driver-daemonset-stmk7 -n gpu-operator -f -c k8s-driver-manager
  • kubeflow pod status kubectl get pod -n admin
@shivamerla
Copy link
Contributor

@alloydm thanks for reporting this. When nvidia driver modules are not loaded (during reboot scenario), we can avoid evicting user GPU pods. Will address this in next patch release.

@shivamerla
Copy link
Contributor

@alloydm there are couple of ways this can be mitigated.

  1. Enable driver upgrade controller with driver.upgradePolicy.autoUpgrade=true, in that case the initContainer will not handle GPU pod eviction, but the upgrade controller within the operator. This is triggered only during driver daemonset spec updates and not host reboot.
  2. Disable "ENABLE_GPU_POD_EVICTION" with the driver manager. With this disabled, on node reboot, since no driver is loaded, we do not attempt GPU pod eviction or node drain. But in cases when the driver container restarts abruptively, it will not evict GPU pods and will be stuck in crashloop.

We will add a fix to avoid nvdrain during node reboot case.

@alloydm
Copy link
Author

alloydm commented Oct 16, 2023

@shivamerla
we dont want to enable autoUpgrade to true
Disable "ENABLE_GPU_POD_EVICTION" with the driver manager - tried this, but since this is a statefulset controlled pod, this pod goes to terminating stage when node goes down, and then stays in terminating state forever

I am attaching kubernetes doc information on why this is happening
https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/

We are not hitting this issue with upgrade as there is option in driver env to forcefully deleting user gpu pod. can we have that forcefully deleting user gpu pod env here too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants