-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gpu-feature-discovery , nvidia-container-toolkit-daemonset , nvidia-device-plugin-daemonset & nvidia-driver-daemonset is not getting removed after GPU node get drained off from the cluster #584
Comments
@shivamerla @cdesiniotis Please suggest on this |
@shnigam2 Can you share your gpu node yaml manifest? |
@tariq1890 Please find the manifest of GPU node when all nvidia pods are in running state:
|
How are you draining these nodes? Please ensure |
@tariq1890 Like direct termination of backend ec2 instance and it was removing all these nvidia pods till k8s 1.24, but on k8s 1.26 version these 4 pods shows running even underline instance was already removed. Any parameter which we need to pass for k8s 1.26. |
@tariq1890 @cdesiniotis @shivamerla please let me know how to fix this issue. Daemonsets are not getting scaled down on node termination by cluster autoscaler. This would ideally removed all nvidia damonset pods on node removal which is not happening in our case. |
@shivamerla could you please help us to understand the cause of such behavior we are using flatcar as worker node. |
@shivamerla @tariq1890 @cdesiniotis Could you please help us in fixing this behaviour , due to this unnecessarly showing pods in namepace which actually not exist as node already got scaled down. |
@shnigam2 Can you provide logs from the k8s controller-manager pod to check for errors on cleaning up these pods? Are you using images from private registry(i.e using pullSecrets)? |
@shivamerla Yes we are using private registry , please find the controller-manager logs for errors:
|
@shivamerla can you plz check and help on this |
@shnigam2 we have a known issue which will be fixed in next patch v23.9.1 (later this month). The problem is we are adding duplicate pullSecrets in the spec. You can avoid this by not specifying the pullSecret in |
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
2. Issue or feature description
gpu-feature-discovery , nvidia-container-toolkit-daemonset , nvidia-device-plugin-daemonset & nvidia-driver-daemonset is not getting removed after GPU node get drained off from the cluster. In description of these pods shows :-
Logs of k8s-driver-manager before terminating gpu node
Value which we are passing for helm :-
Please let us know how to control this pod eviction when gpu node get scale down as these pods shows in running even after gpu node got removed from cluster.
The text was updated successfully, but these errors were encountered: