Skip to content

Update Node Feature Discovery to 0.14.x to enable NodeFeature GC #580

Closed
@chiragjn

Description

@chiragjn

v0.14.0 was released last week which includes garbage collection for NodeFeature objects for removed nodes - Topology GC has been renamed and extended

PR: kubernetes-sigs/node-feature-discovery#1305
Chart values: https://github.com/kubernetes-sigs/node-feature-discovery/blob/v0.14.0/docs/deployment/helm.md#garbage-collector-parameters


I am primarily interested in this because as I reported in #573 (comment) we have seen increased memory usage and instability on NFD master as the number of provisioned and de-provisioned GPU nodes grows which ultimately causes failures in workloads with "unhealthy nvidia/gpu" error. (My current hypothesis is that it enters a relabelling iteration on a node, removes older labels, and marks gpu.present=false, dies (because OOM Killed) before it gets to label the node correctly, which ultimately kills the device plugin because it has node selector with gpu.deploy.device-plugin label which is now removed, making any nvidia.com/gpu device unhealthy.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions