-
Notifications
You must be signed in to change notification settings - Fork 462
Description
v0.14.0 was released last week which includes garbage collection for NodeFeature objects for removed nodes - Topology GC has been renamed and extended
PR: kubernetes-sigs/node-feature-discovery#1305
Chart values: https://github.com/kubernetes-sigs/node-feature-discovery/blob/v0.14.0/docs/deployment/helm.md#garbage-collector-parameters
I am primarily interested in this because as I reported in #573 (comment) we have seen increased memory usage and instability on NFD master as the number of provisioned and de-provisioned GPU nodes grows which ultimately causes failures in workloads with "unhealthy nvidia/gpu" error. (My current hypothesis is that it enters a relabelling iteration on a node, removes older labels, and marks gpu.present=false, dies (because OOM Killed) before it gets to label the node correctly, which ultimately kills the device plugin because it has node selector with gpu.deploy.device-plugin label which is now removed, making any nvidia.com/gpu device unhealthy.)