Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Node Feature Discovery to 0.14.x to enable NodeFeature GC #580

Closed
chiragjn opened this issue Sep 13, 2023 · 2 comments
Closed

Update Node Feature Discovery to 0.14.x to enable NodeFeature GC #580

chiragjn opened this issue Sep 13, 2023 · 2 comments

Comments

@chiragjn
Copy link

v0.14.0 was released last week which includes garbage collection for NodeFeature objects for removed nodes - Topology GC has been renamed and extended

PR: kubernetes-sigs/node-feature-discovery#1305
Chart values: https://github.com/kubernetes-sigs/node-feature-discovery/blob/v0.14.0/docs/deployment/helm.md#garbage-collector-parameters


I am primarily interested in this because as I reported in #573 (comment) we have seen increased memory usage and instability on NFD master as the number of provisioned and de-provisioned GPU nodes grows which ultimately causes failures in workloads with "unhealthy nvidia/gpu" error. (My current hypothesis is that it enters a relabelling iteration on a node, removes older labels, and marks gpu.present=false, dies (because OOM Killed) before it gets to label the node correctly, which ultimately kills the device plugin because it has node selector with gpu.deploy.device-plugin label which is now removed, making any nvidia.com/gpu device unhealthy.)

@chiragjn
Copy link
Author

I was able to run the GC component independently and at least the results seem promising?
These are the memory footprints

Red line is the NFD master
Blue is my first attempt to run GC - OOM Killed
Green is successful GC

image

@shivamerla
Copy link
Contributor

We are enabling this with v23.9.0 release later this month. NFD version has been bumped to v0.14.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants