CP-33636: add DCGM scrape job

evan-cz · evan-cz · commit 743bdda890b3 · 2025-10-08T19:02:53.000-04:00
## What

Add support for collecting NVIDIA GPU metrics from DCGM Exporter. This implementation uses Kubernetes service discovery to automatically find DCGM Exporter instances by matching services with the label `app.kubernetes.io/name=dcgm-exporter` (the default label set by the DCGM Exporter Helm chart). This label matching may need to be configurable in the future to support non-standard deployments.

The scrape job collects three DCGM metrics and renames them using the `cz_` prefix (CloudZero-specific namespace to avoid conflicting with cAdvisor's `container_` prefix):

- `DCGM_FI_DEV_GPU_UTIL` → `cz_gpu_usage_percent` (GPU compute utilization 0-100%)
- `DCGM_FI_DEV_FB_USED` → `cz_gpu_memory_used_bytes` (GPU memory used in MiB)
- `DCGM_FI_DEV_FB_FREE` → `cz_gpu_memory_free_bytes` (GPU memory free in MiB)

All metrics include per-container labels (container, pod, namespace, gpu, modelName, Hostname, UUID) from DCGM Exporter. The number of GPUs per container can be determined by counting distinct `gpu` label values in the `cz_gpu_usage_percent` metric. For example, a container using 2 GPUs will have separate time series with `gpu="0"` and `gpu="1"`.

Metrics are forwarded roughly as-is (a provenance=dcgm is currently added) to the CloudZero collector. Because Prometheus runs in agent mode (which doesn't support local query evaluation or recording rules), additional processing such as computing total GPU memory (`used + free`) or aggregating multi-GPU utilization must be done in the aggregator or on the server side.

Note that there is no standard GPU metrics API across vendors. AMD and Intel GPU exporters use different metric names, label schemas, and exporter configurations. If we expand GPU support to other vendors, each will likely require a separate scrape job configuration.

## Why

Enable collection of GPU metrics data to support early research phases of potential future GPU-related features.

## How Tested

### Helm unittest tests

- Created 10 comprehensive unittest tests in `helm/tests/gpu_metrics_test.yaml`
- Tests verify GPU scrape job configuration, metric renaming, label selectors, and remote_write filters
- All tests passing (`make helm-test-unittest`)

### Manual testing on EKS cluster

- Deployed DCGM Exporter on test EKS cluster (g4dn.xlarge with Tesla T4 GPU)
- Verified Prometheus discovers DCGM service via label selector
- Confirmed all three GPU metrics collected with correct names:
  - `cz_gpu_usage_percent` (0-100% GPU utilization)
  - `cz_gpu_memory_used_bytes` (GPU memory used)
  - `cz_gpu_memory_free_bytes` (GPU memory free)
- Verified metrics include all DCGM labels (gpu, container, pod, namespace, Hostname, modelName, UUID)
- Confirmed metrics forwarded to CloudZero aggregator and classified as cost metrics
- Tested with GPU disabled - confirmed no DCGM scraping occurs

## Questions

* Should this be enabled by default?
diff --git a/app/functions/helmless/default-values.yaml b/app/functions/helmless/default-values.yaml
@@ -554,6 +554,11 @@ prometheusConfig:
       enabled: true
       # Scrape interval for aggregator job
       scrapeInterval: 120s
+    # -- Enables the GPU metrics scrape job (NVIDIA DCGM Exporter auto-discovery).
+    gpu:
+      enabled: false
+      # Scrape interval for GPU metrics job
+      scrapeInterval: 30s
     # -- Any items added to this list will be added to the Prometheus scrape configuration.
     additionalScrapeJobs: []
 
diff --git a/helm/templates/_cm_helpers.tpl b/helm/templates/_cm_helpers.tpl
@@ -163,6 +163,112 @@ remote_write:
       send: false
 {{- end -}}
 
+{{/*
+NVIDIA DCGM GPU Metrics Scrape Job Configuration Template
+
+Generates Prometheus scrape job configuration for collecting NVIDIA GPU metrics from
+DCGM Exporter. This enables CloudZero cost allocation for NVIDIA GPU workloads in
+Kubernetes clusters.
+
+DCGM metrics collected and renamed:
+- DCGM_FI_DEV_GPU_UTIL → cz_gpu_usage_percent: GPU compute utilization (0-100%)
+- DCGM_FI_DEV_FB_USED → cz_gpu_memory_used_bytes: GPU memory used per GPU
+- DCGM_FI_DEV_FB_FREE → cz_gpu_memory_free_bytes: GPU memory free per GPU
+
+Scraping features:
+- Auto-discovery: Kubernetes service discovery with label selector for DCGM services
+- Container attribution: Per-container GPU usage via Kubernetes Pod Resources API
+- Label preservation: All DCGM labels (gpu, container, pod, namespace, Hostname, modelName, UUID) forwarded
+- Metric filtering: Collects only the 3 DCGM metrics needed, drops unattributed metrics
+- Provenance tracking: Adds "provenance=dcgm" label to identify metric source
+
+Note: This template is specific to NVIDIA DCGM Exporter. Future GPU vendors (AMD, Intel)
+will have separate scrape job templates added alongside this one.
+
+This configuration enables accurate GPU cost allocation by tracking per-container
+GPU usage across compute and memory dimensions, supporting multi-GPU containers
+and GPU time-slicing scenarios.
+*/}}
+{{- define "cloudzero-agent.prometheus.scrapeGPU" -}}
+# NVIDIA DCGM GPU Metrics Scrape Job
+# cloudzero-dcgm-exporter
+#
+# Automatically discovers and scrapes NVIDIA GPU metrics from DCGM Exporter
+# for GPU cost allocation and utilization tracking.
+#
+# This job is specific to NVIDIA DCGM Exporter. Future GPU vendors (AMD, Intel)
+# will have separate scrape jobs added to this configuration.
+- job_name: cloudzero-dcgm-exporter
+  scrape_interval: {{ .Values.prometheusConfig.scrapeJobs.gpu.scrapeInterval }}
+
+  # Discover DCGM Exporter services in all namespaces
+  # Use label selector to filter at the Kubernetes API level for performance
+  kubernetes_sd_configs:
+    - role: service
+      kubeconfig_file: ""
+      selectors:
+        - role: service
+          label: "app.kubernetes.io/name=dcgm-exporter"
+
+  # Relabel configs for label enrichment
+  relabel_configs:
+
+    # Add provenance label to indicate DCGM as the metric source
+    - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name]
+      regex: dcgm-exporter
+      replacement: dcgm
+      target_label: provenance
+
+    # Add Kubernetes metadata for cost attribution
+    - source_labels: [__meta_kubernetes_namespace]
+      target_label: kubernetes_namespace
+
+    - source_labels: [__meta_kubernetes_service_name]
+      target_label: kubernetes_service
+
+    # Note: __address__ is automatically set by service discovery to <service>:<port>
+    # No need to override it - Prometheus will use the port from the Service definition
+
+  # Metric relabel configs for filtering and renaming
+  metric_relabel_configs:
+    # Collect only the 3 raw DCGM metrics needed
+    - source_labels: [__name__]
+      regex: DCGM_FI_DEV_GPU_UTIL|DCGM_FI_DEV_FB_USED|DCGM_FI_DEV_FB_FREE
+      action: keep
+
+    # Drop metrics without container attribution
+    # (These are node-level GPU metrics not assigned to containers)
+    - source_labels: [container]
+      regex: ^$
+      action: drop
+
+    - source_labels: [pod]
+      regex: ^$
+      action: drop
+
+    - source_labels: [namespace]
+      regex: ^$
+      action: drop
+
+    # Rename DCGM_FI_DEV_GPU_UTIL to cz_gpu_usage_percent
+    - source_labels: [__name__]
+      regex: DCGM_FI_DEV_GPU_UTIL
+      replacement: cz_gpu_usage_percent
+      target_label: __name__
+
+    # Rename DCGM_FI_DEV_FB_USED to cz_gpu_memory_used_bytes
+    - source_labels: [__name__]
+      regex: DCGM_FI_DEV_FB_USED
+      replacement: cz_gpu_memory_used_bytes
+      target_label: __name__
+
+    # Rename DCGM_FI_DEV_FB_FREE to cz_gpu_memory_free_bytes
+    - source_labels: [__name__]
+      regex: DCGM_FI_DEV_FB_FREE
+      replacement: cz_gpu_memory_free_bytes
+      target_label: __name__
+{{- end -}}
+
 {{/*
 Prometheus Self-Monitoring Scrape Job Configuration Template
 
diff --git a/helm/templates/_defaults.tpl b/helm/templates/_defaults.tpl
@@ -69,6 +69,9 @@ containerMetrics:
   - container_memory_working_set_bytes
   - container_network_receive_bytes_total
   - container_network_transmit_bytes_total
+  - cz_gpu_usage_percent
+  - cz_gpu_memory_used_bytes
+  - cz_gpu_memory_free_bytes
 # CloudZero Agent Operational Metrics - Essential for Agent Health Monitoring
 # These metrics track CloudZero Agent performance, resource usage, and operational health,
 # enabling monitoring, alerting, and troubleshooting of the cost allocation pipeline.
@@ -227,6 +230,9 @@ metricFilters:
         - container_memory_working_set_bytes
         - container_network_receive_bytes_total
         - container_network_transmit_bytes_total
+        - cz_gpu_usage_percent
+        - cz_gpu_memory_used_bytes
+        - cz_gpu_memory_free_bytes
         - kube_node_info
         - kube_node_status_capacity
         - kube_pod_container_resource_limits
diff --git a/helm/templates/agent-cm.yaml b/helm/templates/agent-cm.yaml
@@ -40,6 +40,10 @@ data:
       {{- include "cloudzero-agent.prometheus.scrapePrometheus" . | nindent 6 }}
       {{- end }}
 
+      {{- if .Values.prometheusConfig.scrapeJobs.gpu.enabled }}
+      {{- include "cloudzero-agent.prometheus.scrapeGPU" . | nindent 6 }}
+      {{- end }}{{/* End GPU scrape job */}}
+
       {{- if .Values.prometheusConfig.scrapeJobs.additionalScrapeJobs -}}
       {{ toYaml .Values.prometheusConfig.scrapeJobs.additionalScrapeJobs | toString | nindent 6 }}
       {{- end }}{{/* End additional scrape jobs */}}
diff --git a/helm/tests/gpu_metrics_test.yaml b/helm/tests/gpu_metrics_test.yaml
@@ -0,0 +1,155 @@
+suite: test GPU metrics scrape job configuration
+templates:
+  - agent-cm.yaml
+tests:
+  # Test that GPU scrape job is included when GPU metrics are enabled
+
+  - it: should include DCGM GPU scrape job when GPU metrics enabled
+    template: agent-cm.yaml
+    set:
+      apiKey: "test-key"
+      existingSecretName: null
+      prometheusConfig.scrapeJobs.gpu.enabled: true
+      prometheusConfig.scrapeJobs.gpu.scrapeInterval: 30s
+    asserts:
+      - matchRegex:
+          path: data["prometheus.yml"]
+          pattern: "job_name: cloudzero-dcgm-exporter"
+
+  # Test that GPU scrape job is NOT included when GPU metrics are disabled
+
+  - it: should NOT include GPU scrape job when GPU metrics disabled
+    template: agent-cm.yaml
+    set:
+      apiKey: "test-key"
+      existingSecretName: null
+      prometheusConfig.scrapeJobs.gpu.enabled: false
+    asserts:
+      - notMatchRegex:
+          path: data["prometheus.yml"]
+          pattern: "cloudzero-dcgm-exporter"
+
+  # Test that custom scrape interval is applied correctly
+
+  - it: should use custom GPU scrape interval of 45s when specified
+    template: agent-cm.yaml
+    set:
+      apiKey: "test-key"
+      existingSecretName: null
+      prometheusConfig.scrapeJobs.gpu.enabled: true
+      prometheusConfig.scrapeJobs.gpu.scrapeInterval: 45s
+    asserts:
+      - matchRegex:
+          path: data["prometheus.yml"]
+          pattern: "job_name: cloudzero-dcgm-exporter"
+      - matchRegex:
+          path: data["prometheus.yml"]
+          pattern: "scrape_interval: 45s"
+
+  # Test that DCGM label selector is present
+
+  - it: should include DCGM label selector for service discovery
+    template: agent-cm.yaml
+    set:
+      apiKey: "test-key"
+      existingSecretName: null
+      prometheusConfig.scrapeJobs.gpu.enabled: true
+    asserts:
+      - matchRegex:
+          path: data["prometheus.yml"]
+          pattern: "app.kubernetes.io/name=dcgm-exporter"
+
+  # Test that provenance label is added
+
+  - it: should add provenance=dcgm label to GPU metrics
+    template: agent-cm.yaml
+    set:
+      apiKey: "test-key"
+      existingSecretName: null
+      prometheusConfig.scrapeJobs.gpu.enabled: true
+    asserts:
+      - matchRegex:
+          path: data["prometheus.yml"]
+          pattern: "replacement: dcgm"
+      - matchRegex:
+          path: data["prometheus.yml"]
+          pattern: "target_label: provenance"
+
+  # Test that metric renaming is configured
+
+  - it: should include metric renaming for cz_gpu_usage_percent
+    template: agent-cm.yaml
+    set:
+      apiKey: "test-key"
+      existingSecretName: null
+      prometheusConfig.scrapeJobs.gpu.enabled: true
+    asserts:
+      - matchRegex:
+          path: data["prometheus.yml"]
+          pattern: "DCGM_FI_DEV_GPU_UTIL"
+      - matchRegex:
+          path: data["prometheus.yml"]
+          pattern: "cz_gpu_usage_percent"
+
+  - it: should include metric renaming for cz_gpu_memory_used_bytes
+    template: agent-cm.yaml
+    set:
+      apiKey: "test-key"
+      existingSecretName: null
+      prometheusConfig.scrapeJobs.gpu.enabled: true
+    asserts:
+      - matchRegex:
+          path: data["prometheus.yml"]
+          pattern: "DCGM_FI_DEV_FB_USED"
+      - matchRegex:
+          path: data["prometheus.yml"]
+          pattern: "cz_gpu_memory_used_bytes"
+
+  - it: should include metric renaming for cz_gpu_memory_free_bytes
+    template: agent-cm.yaml
+    set:
+      apiKey: "test-key"
+      existingSecretName: null
+      prometheusConfig.scrapeJobs.gpu.enabled: true
+    asserts:
+      - matchRegex:
+          path: data["prometheus.yml"]
+          pattern: "DCGM_FI_DEV_FB_FREE"
+      - matchRegex:
+          path: data["prometheus.yml"]
+          pattern: "cz_gpu_memory_free_bytes"
+
+  # Test that GPU metrics are in remote_write allow list
+
+  - it: should include GPU metrics in remote_write filter
+    template: agent-cm.yaml
+    set:
+      apiKey: "test-key"
+      existingSecretName: null
+      prometheusConfig.scrapeJobs.gpu.enabled: true
+    asserts:
+      - matchRegex:
+          path: data["prometheus.yml"]
+          pattern: "cz_gpu_usage_percent"
+      - matchRegex:
+          path: data["prometheus.yml"]
+          pattern: "cz_gpu_memory_used_bytes"
+      - matchRegex:
+          path: data["prometheus.yml"]
+          pattern: "cz_gpu_memory_free_bytes"
+
+  # Test that unattributed metrics are dropped
+
+  - it: should drop metrics without container attribution
+    template: agent-cm.yaml
+    set:
+      apiKey: "test-key"
+      existingSecretName: null
+      prometheusConfig.scrapeJobs.gpu.enabled: true
+    asserts:
+      - matchRegex:
+          path: data["prometheus.yml"]
+          pattern: "source_labels: \\[container\\]"
+      - matchRegex:
+          path: data["prometheus.yml"]
+          pattern: "action: drop"
diff --git a/helm/values.schema.json b/helm/values.schema.json
@@ -4591,6 +4591,18 @@
               "required": ["enabled"],
               "type": "object"
             },
+            "gpu": {
+              "additionalProperties": false,
+              "properties": {
+                "enabled": {
+                  "type": "boolean"
+                },
+                "scrapeInterval": {
+                  "$ref": "#/$defs/com.cloudzero.agent.duration"
+                }
+              },
+              "type": "object"
+            },
             "kubeStateMetrics": {
               "additionalProperties": false,
               "properties": {
diff --git a/helm/values.schema.yaml b/helm/values.schema.yaml
@@ -909,6 +909,28 @@ properties:
                 description: |
                   Scrape interval for aggregator job.
                 $ref: "#/$defs/com.cloudzero.agent.duration"
+          gpu:
+            description: |
+              GPU metrics scrape job configuration.
+
+              Automatically discovers and scrapes GPU metrics from NVIDIA DCGM
+              Exporter. Collects GPU compute utilization and memory usage
+              metrics with per-container attribution.
+            type: object
+            additionalProperties: false
+            properties:
+              enabled:
+                description: |
+                  Whether to enable the GPU metrics scrape job.
+
+                  When enabled, Prometheus will automatically discover GPU
+                  exporters (currently only NVIDIA DCGM ins supported) and
+                  collect metrics.
+                type: boolean
+              scrapeInterval:
+                description: |
+                  Scrape interval for GPU metrics job.
+                $ref: "#/$defs/com.cloudzero.agent.duration"
           additionalScrapeJobs:
             description: |
               Additional scrape jobs to add to the configuration.
diff --git a/helm/values.yaml b/helm/values.yaml
@@ -554,6 +554,11 @@ prometheusConfig:
       enabled: true
       # Scrape interval for aggregator job
       scrapeInterval: 120s
+    # -- Enables the GPU metrics scrape job (NVIDIA DCGM Exporter auto-discovery).
+    gpu:
+      enabled: false
+      # Scrape interval for GPU metrics job
+      scrapeInterval: 30s
     # -- Any items added to this list will be added to the Prometheus scrape configuration.
     additionalScrapeJobs: []
 
diff --git a/tests/helm/template/cert-manager.yaml b/tests/helm/template/cert-manager.yaml
diff --git a/tests/helm/template/federated.yaml b/tests/helm/template/federated.yaml
diff --git a/tests/helm/template/manifest.yaml b/tests/helm/template/manifest.yaml