Skip to content

Commit 743bdda

Browse files
committed
CP-33636: add DCGM scrape job
## What Add support for collecting NVIDIA GPU metrics from DCGM Exporter. This implementation uses Kubernetes service discovery to automatically find DCGM Exporter instances by matching services with the label `app.kubernetes.io/name=dcgm-exporter` (the default label set by the DCGM Exporter Helm chart). This label matching may need to be configurable in the future to support non-standard deployments. The scrape job collects three DCGM metrics and renames them using the `cz_` prefix (CloudZero-specific namespace to avoid conflicting with cAdvisor's `container_` prefix): - `DCGM_FI_DEV_GPU_UTIL` → `cz_gpu_usage_percent` (GPU compute utilization 0-100%) - `DCGM_FI_DEV_FB_USED` → `cz_gpu_memory_used_bytes` (GPU memory used in MiB) - `DCGM_FI_DEV_FB_FREE` → `cz_gpu_memory_free_bytes` (GPU memory free in MiB) All metrics include per-container labels (container, pod, namespace, gpu, modelName, Hostname, UUID) from DCGM Exporter. The number of GPUs per container can be determined by counting distinct `gpu` label values in the `cz_gpu_usage_percent` metric. For example, a container using 2 GPUs will have separate time series with `gpu="0"` and `gpu="1"`. Metrics are forwarded roughly as-is (a provenance=dcgm is currently added) to the CloudZero collector. Because Prometheus runs in agent mode (which doesn't support local query evaluation or recording rules), additional processing such as computing total GPU memory (`used + free`) or aggregating multi-GPU utilization must be done in the aggregator or on the server side. Note that there is no standard GPU metrics API across vendors. AMD and Intel GPU exporters use different metric names, label schemas, and exporter configurations. If we expand GPU support to other vendors, each will likely require a separate scrape job configuration. ## Why Enable collection of GPU metrics data to support early research phases of potential future GPU-related features. ## How Tested ### Helm unittest tests - Created 10 comprehensive unittest tests in `helm/tests/gpu_metrics_test.yaml` - Tests verify GPU scrape job configuration, metric renaming, label selectors, and remote_write filters - All tests passing (`make helm-test-unittest`) ### Manual testing on EKS cluster - Deployed DCGM Exporter on test EKS cluster (g4dn.xlarge with Tesla T4 GPU) - Verified Prometheus discovers DCGM service via label selector - Confirmed all three GPU metrics collected with correct names: - `cz_gpu_usage_percent` (0-100% GPU utilization) - `cz_gpu_memory_used_bytes` (GPU memory used) - `cz_gpu_memory_free_bytes` (GPU memory free) - Verified metrics include all DCGM labels (gpu, container, pod, namespace, Hostname, modelName, UUID) - Confirmed metrics forwarded to CloudZero aggregator and classified as cost metrics - Tested with GPU disabled - confirmed no DCGM scraping occurs ## Questions * Should this be enabled by default?
1 parent bdd7111 commit 743bdda

File tree

11 files changed

+352
-10
lines changed

11 files changed

+352
-10
lines changed

app/functions/helmless/default-values.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -554,6 +554,11 @@ prometheusConfig:
554554
enabled: true
555555
# Scrape interval for aggregator job
556556
scrapeInterval: 120s
557+
# -- Enables the GPU metrics scrape job (NVIDIA DCGM Exporter auto-discovery).
558+
gpu:
559+
enabled: false
560+
# Scrape interval for GPU metrics job
561+
scrapeInterval: 30s
557562
# -- Any items added to this list will be added to the Prometheus scrape configuration.
558563
additionalScrapeJobs: []
559564

helm/templates/_cm_helpers.tpl

Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -163,6 +163,112 @@ remote_write:
163163
send: false
164164
{{- end -}}
165165

166+
{{/*
167+
NVIDIA DCGM GPU Metrics Scrape Job Configuration Template
168+
169+
Generates Prometheus scrape job configuration for collecting NVIDIA GPU metrics from
170+
DCGM Exporter. This enables CloudZero cost allocation for NVIDIA GPU workloads in
171+
Kubernetes clusters.
172+
173+
DCGM metrics collected and renamed:
174+
- DCGM_FI_DEV_GPU_UTIL → cz_gpu_usage_percent: GPU compute utilization (0-100%)
175+
- DCGM_FI_DEV_FB_USED → cz_gpu_memory_used_bytes: GPU memory used per GPU
176+
- DCGM_FI_DEV_FB_FREE → cz_gpu_memory_free_bytes: GPU memory free per GPU
177+
178+
Scraping features:
179+
- Auto-discovery: Kubernetes service discovery with label selector for DCGM services
180+
- Container attribution: Per-container GPU usage via Kubernetes Pod Resources API
181+
- Label preservation: All DCGM labels (gpu, container, pod, namespace, Hostname, modelName, UUID) forwarded
182+
- Metric filtering: Collects only the 3 DCGM metrics needed, drops unattributed metrics
183+
- Provenance tracking: Adds "provenance=dcgm" label to identify metric source
184+
185+
Note: This template is specific to NVIDIA DCGM Exporter. Future GPU vendors (AMD, Intel)
186+
will have separate scrape job templates added alongside this one.
187+
188+
This configuration enables accurate GPU cost allocation by tracking per-container
189+
GPU usage across compute and memory dimensions, supporting multi-GPU containers
190+
and GPU time-slicing scenarios.
191+
*/}}
192+
{{- define "cloudzero-agent.prometheus.scrapeGPU" -}}
193+
# NVIDIA DCGM GPU Metrics Scrape Job
194+
# cloudzero-dcgm-exporter
195+
#
196+
# Automatically discovers and scrapes NVIDIA GPU metrics from DCGM Exporter
197+
# for GPU cost allocation and utilization tracking.
198+
#
199+
# This job is specific to NVIDIA DCGM Exporter. Future GPU vendors (AMD, Intel)
200+
# will have separate scrape jobs added to this configuration.
201+
- job_name: cloudzero-dcgm-exporter
202+
scrape_interval: {{ .Values.prometheusConfig.scrapeJobs.gpu.scrapeInterval }}
203+
204+
# Discover DCGM Exporter services in all namespaces
205+
# Use label selector to filter at the Kubernetes API level for performance
206+
kubernetes_sd_configs:
207+
- role: service
208+
kubeconfig_file: ""
209+
selectors:
210+
- role: service
211+
label: "app.kubernetes.io/name=dcgm-exporter"
212+
213+
# Relabel configs for label enrichment
214+
relabel_configs:
215+
216+
# Add provenance label to indicate DCGM as the metric source
217+
- source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name]
218+
regex: dcgm-exporter
219+
replacement: dcgm
220+
target_label: provenance
221+
222+
# Add Kubernetes metadata for cost attribution
223+
- source_labels: [__meta_kubernetes_namespace]
224+
target_label: kubernetes_namespace
225+
226+
- source_labels: [__meta_kubernetes_service_name]
227+
target_label: kubernetes_service
228+
229+
# Note: __address__ is automatically set by service discovery to <service>:<port>
230+
# No need to override it - Prometheus will use the port from the Service definition
231+
232+
# Metric relabel configs for filtering and renaming
233+
metric_relabel_configs:
234+
# Collect only the 3 raw DCGM metrics needed
235+
- source_labels: [__name__]
236+
regex: DCGM_FI_DEV_GPU_UTIL|DCGM_FI_DEV_FB_USED|DCGM_FI_DEV_FB_FREE
237+
action: keep
238+
239+
# Drop metrics without container attribution
240+
# (These are node-level GPU metrics not assigned to containers)
241+
- source_labels: [container]
242+
regex: ^$
243+
action: drop
244+
245+
- source_labels: [pod]
246+
regex: ^$
247+
action: drop
248+
249+
- source_labels: [namespace]
250+
regex: ^$
251+
action: drop
252+
253+
# Rename DCGM_FI_DEV_GPU_UTIL to cz_gpu_usage_percent
254+
- source_labels: [__name__]
255+
regex: DCGM_FI_DEV_GPU_UTIL
256+
replacement: cz_gpu_usage_percent
257+
target_label: __name__
258+
259+
# Rename DCGM_FI_DEV_FB_USED to cz_gpu_memory_used_bytes
260+
- source_labels: [__name__]
261+
regex: DCGM_FI_DEV_FB_USED
262+
replacement: cz_gpu_memory_used_bytes
263+
target_label: __name__
264+
265+
# Rename DCGM_FI_DEV_FB_FREE to cz_gpu_memory_free_bytes
266+
- source_labels: [__name__]
267+
regex: DCGM_FI_DEV_FB_FREE
268+
replacement: cz_gpu_memory_free_bytes
269+
target_label: __name__
270+
{{- end -}}
271+
166272
{{/*
167273
Prometheus Self-Monitoring Scrape Job Configuration Template
168274

helm/templates/_defaults.tpl

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,9 @@ containerMetrics:
6969
- container_memory_working_set_bytes
7070
- container_network_receive_bytes_total
7171
- container_network_transmit_bytes_total
72+
- cz_gpu_usage_percent
73+
- cz_gpu_memory_used_bytes
74+
- cz_gpu_memory_free_bytes
7275
# CloudZero Agent Operational Metrics - Essential for Agent Health Monitoring
7376
# These metrics track CloudZero Agent performance, resource usage, and operational health,
7477
# enabling monitoring, alerting, and troubleshooting of the cost allocation pipeline.
@@ -227,6 +230,9 @@ metricFilters:
227230
- container_memory_working_set_bytes
228231
- container_network_receive_bytes_total
229232
- container_network_transmit_bytes_total
233+
- cz_gpu_usage_percent
234+
- cz_gpu_memory_used_bytes
235+
- cz_gpu_memory_free_bytes
230236
- kube_node_info
231237
- kube_node_status_capacity
232238
- kube_pod_container_resource_limits

helm/templates/agent-cm.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,10 @@ data:
4040
{{- include "cloudzero-agent.prometheus.scrapePrometheus" . | nindent 6 }}
4141
{{- end }}
4242
43+
{{- if .Values.prometheusConfig.scrapeJobs.gpu.enabled }}
44+
{{- include "cloudzero-agent.prometheus.scrapeGPU" . | nindent 6 }}
45+
{{- end }}{{/* End GPU scrape job */}}
46+
4347
{{- if .Values.prometheusConfig.scrapeJobs.additionalScrapeJobs -}}
4448
{{ toYaml .Values.prometheusConfig.scrapeJobs.additionalScrapeJobs | toString | nindent 6 }}
4549
{{- end }}{{/* End additional scrape jobs */}}

helm/tests/gpu_metrics_test.yaml

Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
suite: test GPU metrics scrape job configuration
2+
templates:
3+
- agent-cm.yaml
4+
tests:
5+
# Test that GPU scrape job is included when GPU metrics are enabled
6+
7+
- it: should include DCGM GPU scrape job when GPU metrics enabled
8+
template: agent-cm.yaml
9+
set:
10+
apiKey: "test-key"
11+
existingSecretName: null
12+
prometheusConfig.scrapeJobs.gpu.enabled: true
13+
prometheusConfig.scrapeJobs.gpu.scrapeInterval: 30s
14+
asserts:
15+
- matchRegex:
16+
path: data["prometheus.yml"]
17+
pattern: "job_name: cloudzero-dcgm-exporter"
18+
19+
# Test that GPU scrape job is NOT included when GPU metrics are disabled
20+
21+
- it: should NOT include GPU scrape job when GPU metrics disabled
22+
template: agent-cm.yaml
23+
set:
24+
apiKey: "test-key"
25+
existingSecretName: null
26+
prometheusConfig.scrapeJobs.gpu.enabled: false
27+
asserts:
28+
- notMatchRegex:
29+
path: data["prometheus.yml"]
30+
pattern: "cloudzero-dcgm-exporter"
31+
32+
# Test that custom scrape interval is applied correctly
33+
34+
- it: should use custom GPU scrape interval of 45s when specified
35+
template: agent-cm.yaml
36+
set:
37+
apiKey: "test-key"
38+
existingSecretName: null
39+
prometheusConfig.scrapeJobs.gpu.enabled: true
40+
prometheusConfig.scrapeJobs.gpu.scrapeInterval: 45s
41+
asserts:
42+
- matchRegex:
43+
path: data["prometheus.yml"]
44+
pattern: "job_name: cloudzero-dcgm-exporter"
45+
- matchRegex:
46+
path: data["prometheus.yml"]
47+
pattern: "scrape_interval: 45s"
48+
49+
# Test that DCGM label selector is present
50+
51+
- it: should include DCGM label selector for service discovery
52+
template: agent-cm.yaml
53+
set:
54+
apiKey: "test-key"
55+
existingSecretName: null
56+
prometheusConfig.scrapeJobs.gpu.enabled: true
57+
asserts:
58+
- matchRegex:
59+
path: data["prometheus.yml"]
60+
pattern: "app.kubernetes.io/name=dcgm-exporter"
61+
62+
# Test that provenance label is added
63+
64+
- it: should add provenance=dcgm label to GPU metrics
65+
template: agent-cm.yaml
66+
set:
67+
apiKey: "test-key"
68+
existingSecretName: null
69+
prometheusConfig.scrapeJobs.gpu.enabled: true
70+
asserts:
71+
- matchRegex:
72+
path: data["prometheus.yml"]
73+
pattern: "replacement: dcgm"
74+
- matchRegex:
75+
path: data["prometheus.yml"]
76+
pattern: "target_label: provenance"
77+
78+
# Test that metric renaming is configured
79+
80+
- it: should include metric renaming for cz_gpu_usage_percent
81+
template: agent-cm.yaml
82+
set:
83+
apiKey: "test-key"
84+
existingSecretName: null
85+
prometheusConfig.scrapeJobs.gpu.enabled: true
86+
asserts:
87+
- matchRegex:
88+
path: data["prometheus.yml"]
89+
pattern: "DCGM_FI_DEV_GPU_UTIL"
90+
- matchRegex:
91+
path: data["prometheus.yml"]
92+
pattern: "cz_gpu_usage_percent"
93+
94+
- it: should include metric renaming for cz_gpu_memory_used_bytes
95+
template: agent-cm.yaml
96+
set:
97+
apiKey: "test-key"
98+
existingSecretName: null
99+
prometheusConfig.scrapeJobs.gpu.enabled: true
100+
asserts:
101+
- matchRegex:
102+
path: data["prometheus.yml"]
103+
pattern: "DCGM_FI_DEV_FB_USED"
104+
- matchRegex:
105+
path: data["prometheus.yml"]
106+
pattern: "cz_gpu_memory_used_bytes"
107+
108+
- it: should include metric renaming for cz_gpu_memory_free_bytes
109+
template: agent-cm.yaml
110+
set:
111+
apiKey: "test-key"
112+
existingSecretName: null
113+
prometheusConfig.scrapeJobs.gpu.enabled: true
114+
asserts:
115+
- matchRegex:
116+
path: data["prometheus.yml"]
117+
pattern: "DCGM_FI_DEV_FB_FREE"
118+
- matchRegex:
119+
path: data["prometheus.yml"]
120+
pattern: "cz_gpu_memory_free_bytes"
121+
122+
# Test that GPU metrics are in remote_write allow list
123+
124+
- it: should include GPU metrics in remote_write filter
125+
template: agent-cm.yaml
126+
set:
127+
apiKey: "test-key"
128+
existingSecretName: null
129+
prometheusConfig.scrapeJobs.gpu.enabled: true
130+
asserts:
131+
- matchRegex:
132+
path: data["prometheus.yml"]
133+
pattern: "cz_gpu_usage_percent"
134+
- matchRegex:
135+
path: data["prometheus.yml"]
136+
pattern: "cz_gpu_memory_used_bytes"
137+
- matchRegex:
138+
path: data["prometheus.yml"]
139+
pattern: "cz_gpu_memory_free_bytes"
140+
141+
# Test that unattributed metrics are dropped
142+
143+
- it: should drop metrics without container attribution
144+
template: agent-cm.yaml
145+
set:
146+
apiKey: "test-key"
147+
existingSecretName: null
148+
prometheusConfig.scrapeJobs.gpu.enabled: true
149+
asserts:
150+
- matchRegex:
151+
path: data["prometheus.yml"]
152+
pattern: "source_labels: \\[container\\]"
153+
- matchRegex:
154+
path: data["prometheus.yml"]
155+
pattern: "action: drop"

helm/values.schema.json

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4591,6 +4591,18 @@
45914591
"required": ["enabled"],
45924592
"type": "object"
45934593
},
4594+
"gpu": {
4595+
"additionalProperties": false,
4596+
"properties": {
4597+
"enabled": {
4598+
"type": "boolean"
4599+
},
4600+
"scrapeInterval": {
4601+
"$ref": "#/$defs/com.cloudzero.agent.duration"
4602+
}
4603+
},
4604+
"type": "object"
4605+
},
45944606
"kubeStateMetrics": {
45954607
"additionalProperties": false,
45964608
"properties": {

helm/values.schema.yaml

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -909,6 +909,28 @@ properties:
909909
description: |
910910
Scrape interval for aggregator job.
911911
$ref: "#/$defs/com.cloudzero.agent.duration"
912+
gpu:
913+
description: |
914+
GPU metrics scrape job configuration.
915+
916+
Automatically discovers and scrapes GPU metrics from NVIDIA DCGM
917+
Exporter. Collects GPU compute utilization and memory usage
918+
metrics with per-container attribution.
919+
type: object
920+
additionalProperties: false
921+
properties:
922+
enabled:
923+
description: |
924+
Whether to enable the GPU metrics scrape job.
925+
926+
When enabled, Prometheus will automatically discover GPU
927+
exporters (currently only NVIDIA DCGM ins supported) and
928+
collect metrics.
929+
type: boolean
930+
scrapeInterval:
931+
description: |
932+
Scrape interval for GPU metrics job.
933+
$ref: "#/$defs/com.cloudzero.agent.duration"
912934
additionalScrapeJobs:
913935
description: |
914936
Additional scrape jobs to add to the configuration.

helm/values.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -554,6 +554,11 @@ prometheusConfig:
554554
enabled: true
555555
# Scrape interval for aggregator job
556556
scrapeInterval: 120s
557+
# -- Enables the GPU metrics scrape job (NVIDIA DCGM Exporter auto-discovery).
558+
gpu:
559+
enabled: false
560+
# Scrape interval for GPU metrics job
561+
scrapeInterval: 30s
557562
# -- Any items added to this list will be added to the Prometheus scrape configuration.
558563
additionalScrapeJobs: []
559564

0 commit comments

Comments
 (0)