Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Support for new customMetrics value in DCGM Exporter #934

Closed
chipzoller opened this issue Aug 16, 2024 · 8 comments · Fixed by #949
Closed

[Feature] Support for new customMetrics value in DCGM Exporter #934

chipzoller opened this issue Aug 16, 2024 · 8 comments · Fixed by #949
Labels
feature issue/PR that proposes a new feature or functionality

Comments

@chipzoller
Copy link
Contributor

NVIDIA/dcgm-exporter#351 adds support for user-definable custom metrics directly in the Helm values file which greatly simplifies the user experience needed when deploying DCGM Exporter with emission of custom metrics. Adding support for this field in the operator's values file would be very useful. Without it, users would have to resort to the (values-file undocumented) field in the ClusterPolicy at config, which accepts the name of a ConfigMap, meaning users have to pre-create this prior to deploying the operator (see also template ref here).

Please consider adding support for the new customMetrics field in the ClusterPolicy CRD and, by extension, in the Helm values of the operator to allow users to deploy the operator and DCGM Exporter with a list of custom metrics upfront.

@chipzoller
Copy link
Contributor Author

cc @francisguillier

@chipzoller
Copy link
Contributor Author

Relates to #935

@cdesiniotis cdesiniotis added the feature issue/PR that proposes a new feature or functionality label Aug 21, 2024
@cdesiniotis
Copy link
Contributor

@chipzoller thanks for the feature request. Our helm chart offers a similar experience for the k8s-device-plugin and mig-manager configuration files, where the user can provide the ConfigMap data directly via the values.yaml file, as opposed to creating a ConfigMap beforehand. See https://github.com/NVIDIA/gpu-operator/blob/v24.6.1/deployments/gpu-operator/values.yaml#L263-L291 and https://github.com/NVIDIA/gpu-operator/blob/v24.6.1/deployments/gpu-operator/values.yaml#L361-L399. We may want to provide a similar set of fields in the values file for dcgm-exporter.

PRs are welcome if you are interested in taking this on.

cc @tariq1890

@chipzoller
Copy link
Contributor Author

Hi @cdesiniotis, yep, totally, but because everything behind the scenes is handled by templating the ClusterPolicy custom resource, the changes needed here are more extensive as the API needs to be extended in the CRD to accommodate a new field (and other changes). If it were just a simple matter of extending the Helm values that'd be no problem, so unfortunately someone else may be better to pick this up.

@cdesiniotis
Copy link
Contributor

The required changes should be contained to just the helm values and templates. We shouldn't need to extend the CRD at all. See the PR which added this feature to the mig-manager component: #803

@chipzoller
Copy link
Contributor Author

I see. Looks like it should piggyback off the config object I'm attempting to document in #935. I was thinking the CR would need a new field to "natively" implement this as a block scalar value. I'll take another look at this and see if I can send a PR.

@chipzoller
Copy link
Contributor Author

@cdesiniotis, can you help unblock me here?

@chipzoller
Copy link
Contributor Author

Thanks for your help, @cdesiniotis. PR is up at #949.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature issue/PR that proposes a new feature or functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants