Update Node Feature Discovery to 0.14.x to enable NodeFeature GC

v0.14.0 was released last week which includes garbage collection for NodeFeature objects for removed nodes - Topology GC has been renamed and extended

PR: https://github.com/kubernetes-sigs/node-feature-discovery/pull/1305
Chart values: https://github.com/kubernetes-sigs/node-feature-discovery/blob/v0.14.0/docs/deployment/helm.md#garbage-collector-parameters

---
I am primarily interested in this because as I reported in https://github.com/NVIDIA/gpu-operator/issues/573#issuecomment-1718010913 we have seen increased memory usage and instability on NFD master as the number of provisioned and de-provisioned GPU nodes grows which ultimately causes failures in workloads with "unhealthy nvidia/gpu" error. (My current hypothesis is that it enters a relabelling iteration on a node, removes older labels, and marks `gpu.present=false`, dies (because  OOM Killed) before it gets to label the node correctly, which ultimately kills the device plugin because it has node selector with `gpu.deploy.device-plugin` label which is now removed, making any `nvidia.com/gpu` device unhealthy.)




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update Node Feature Discovery to 0.14.x to enable NodeFeature GC #580

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Update Node Feature Discovery to 0.14.x to enable NodeFeature GC #580

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions