Description
v0.14.0 was released last week which includes garbage collection for NodeFeature objects for removed nodes - Topology GC has been renamed and extended
PR: kubernetes-sigs/node-feature-discovery#1305
Chart values: https://github.com/kubernetes-sigs/node-feature-discovery/blob/v0.14.0/docs/deployment/helm.md#garbage-collector-parameters
I am primarily interested in this because as I reported in #573 (comment) we have seen increased memory usage and instability on NFD master as the number of provisioned and de-provisioned GPU nodes grows which ultimately causes failures in workloads with "unhealthy nvidia/gpu" error. (My current hypothesis is that it enters a relabelling iteration on a node, removes older labels, and marks gpu.present=false
, dies (because OOM Killed) before it gets to label the node correctly, which ultimately kills the device plugin because it has node selector with gpu.deploy.device-plugin
label which is now removed, making any nvidia.com/gpu
device unhealthy.)