You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am primarily interested in this because as I reported in #573 (comment) we have seen increased memory usage and instability on NFD master as the number of provisioned and de-provisioned GPU nodes grows which ultimately causes failures in workloads with "unhealthy nvidia/gpu" error. (My current hypothesis is that it enters a relabelling iteration on a node, removes older labels, and marks gpu.present=false, dies (because OOM Killed) before it gets to label the node correctly, which ultimately kills the device plugin because it has node selector with gpu.deploy.device-plugin label which is now removed, making any nvidia.com/gpu device unhealthy.)
The text was updated successfully, but these errors were encountered:
v0.14.0 was released last week which includes garbage collection for NodeFeature objects for removed nodes - Topology GC has been renamed and extended
PR: kubernetes-sigs/node-feature-discovery#1305
Chart values: https://github.com/kubernetes-sigs/node-feature-discovery/blob/v0.14.0/docs/deployment/helm.md#garbage-collector-parameters
I am primarily interested in this because as I reported in #573 (comment) we have seen increased memory usage and instability on NFD master as the number of provisioned and de-provisioned GPU nodes grows which ultimately causes failures in workloads with "unhealthy nvidia/gpu" error. (My current hypothesis is that it enters a relabelling iteration on a node, removes older labels, and marks
gpu.present=false
, dies (because OOM Killed) before it gets to label the node correctly, which ultimately kills the device plugin because it has node selector withgpu.deploy.device-plugin
label which is now removed, making anynvidia.com/gpu
device unhealthy.)The text was updated successfully, but these errors were encountered: