CP-33636: add DCGM GPU metric collection and transformation #494

evan-cz · 2025-10-08T22:35:35Z

CP-33636: add DCGM GPU metrics collection and transformation

Add complete NVIDIA DCGM GPU metrics support with Prometheus scraping
and transformation pipeline.

Prometheus Scrape Configuration:

Added DCGM Exporter scrape job with Kubernetes service discovery
Matches services labeled app.kubernetes.io/name=dcgm-exporter
Collects DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_FB_USED, DCGM_FI_DEV_FB_FREE
Includes 10 comprehensive Helm unit tests for scrape configuration

Metric Transformation Pipeline:

Implemented hexagonal architecture for vendor-agnostic GPU metrics:
- MetricTransformer interface in app/types (port)
- Catalog transformer with sequential routing (domain service)
- DCGM transformer for NVIDIA metrics (adapter)

DCGM Transformations:

DCGM_FI_DEV_GPU_UTIL → container_resources_gpu_usage_percent
DCGM_FI_DEV_FB_USED + FREE → container_resources_gpu_memory_usage_percent
UUID label renamed to gpu_uuid with GPU- prefix stripped
Supports both 'node' and 'Hostname' labels for node identification
Memory metrics buffered and calculated as percentages in flush phase

Testing:

96.4% test coverage for DCGM transformer with table-driven tests
Edge cases: missing labels, invalid values, incomplete pairs
10 Helm unit tests for scrape configuration and metric filtering
Verified on EKS cluster (g4dn.xlarge with Tesla T4 GPU)

Documentation:

README.md with transformation tables and processing strategy
CLAUDE.md with AI development guide and architecture context
Documented label transformations, node identification, and edge cases

Configuration:

Added 100.0 to golangci-lint allowFloats for percentage conversion
Updated Makefile for transform package test target

This provides foundation for multi-vendor GPU support (AMD, Intel)
following established architectural patterns in the codebase.

claude · 2025-10-08T22:37:50Z

Documentation Review - PR #494

Summary

This PR adds NVIDIA GPU metrics collection capability but lacks necessary documentation updates to inform users about this new feature.

Overall Assessment: While the implementation is well-tested and the PR description is thorough, user-facing documentation is incomplete. Users will have no way to discover or understand this new capability without reading the code or schema files.

Missing Documentation Updates

1. helm/README.md - Critical Omission (lines 1-320)

Issue: The Installation Guide and configuration documentation make no mention of GPU metrics collection.

Required: Add a new section documenting the GPU metrics feature:

Location: After "Labels and Annotations" section (around line 217)
Should include:
- What GPU metrics are collected (cz_gpu_usage_percent, cz_gpu_memory_used_bytes, cz_gpu_memory_free_bytes)
- How to enable GPU metrics collection (prometheusConfig.scrapeJobs.gpu.enabled: true)
- Prerequisites (DCGM Exporter must be deployed with standard labels)
- Configuration example showing gpu.enabled and gpu.scrapeInterval settings
- Note about current NVIDIA-only support with future AMD/Intel vendor support planned

Example addition:

### GPU Metrics Collection

The CloudZero Agent can collect NVIDIA GPU metrics from DCGM Exporter to enable GPU cost allocation.

#### Prerequisites
- DCGM Exporter deployed in your cluster (e.g., via the `dcgm-exporter` Helm chart)
- DCGM Exporter services labeled with `app.kubernetes.io/name=dcgm-exporter`

#### Collected Metrics
- `cz_gpu_usage_percent` - GPU compute utilization (0-100%)
- `cz_gpu_memory_used_bytes` - GPU memory used per GPU (in MiB)
- `cz_gpu_memory_free_bytes` - GPU memory free per GPU (in MiB)

#### Configuration
```yaml
prometheusConfig:
  scrapeJobs:
    gpu:
      enabled: true          # Default: false
      scrapeInterval: 30s    # Default: 30s

Note: Currently supports NVIDIA GPUs only. AMD and Intel GPU support planned for future releases.


---

#### 2. **README.md** (Root) - Missing from Feature List
**Issue**: The root README lists the main components (Webhook Server, Collector, Shipper, Validator) but doesn't mention that the Collector now supports GPU metrics.

**Required**: Update the "CloudZero Collector" description (around line 14) to mention GPU metrics support:
```markdown
- _CloudZero Collector_ - The collector application which implements a prometheus compliant interface for metrics collection; which writes the metrics payloads to files to a shared location for consumption by the shipper. The collector classifies incoming metrics data, and will save the data into either cost telemetry files, or into observability files. **Supports container metrics, node metrics, and GPU metrics (NVIDIA DCGM Exporter) for comprehensive cost allocation.** These files are compressed on disk to save space.

3. CONFIGURATION.md - Incomplete

Issue: The configuration guide documents the validator configuration but does not cover user-configurable Helm values for GPU metrics.

Required: While this file focuses on validator config, consider adding a reference to helm/README.md for Prometheus scrape job configuration, or add a brief note about GPU metrics being configurable via prometheusConfig.scrapeJobs.gpu.* settings.

PR Description Validation

✅ PR description accurately reflects implementation:

Correctly describes the three DCGM metrics collected and their renaming
Accurately explains the label selector mechanism
Properly notes limitations (NVIDIA-only, no recording rules in agent mode)
Testing coverage is comprehensive

❌ PR description mentions "Questions" section:

"Should this be enabled by default?" - The answer (default: false) is implemented correctly given that DCGM Exporter is not universally deployed

Positive Observations

Excellent inline documentation: The _cm_helpers.tpl template has exceptional comments explaining the scrape job configuration
Comprehensive testing: 10 helm-unittest tests provide thorough coverage
Schema validation: Both JSON and YAML schemas updated correctly
Metric naming: The cz_ prefix is well-chosen to avoid conflicts

Recommendations

Before merge:

Update helm/README.md with GPU metrics documentation section
Update root README.md to mention GPU metrics support in Collector description

Consider for follow-up:

Create a dedicated GPU metrics guide in helm/docs/gpu-metrics.md if this feature expands
Update any customer-facing documentation or migration guides
Add GPU metrics to the troubleshooting guide (helm/docs/troubleshooting-guide.md)

Conclusion: Implementation quality is high, but user-facing documentation must be updated before merge to ensure customers can discover and configure this new capability.

claude · 2025-10-08T23:04:52Z

Documentation Issues

PR Description - 0/10

This PR needs significant documentation improvements across multiple files.

claude · 2025-10-08T23:04:58Z

Possibly Missing Documentation Updates

File: README.md
Issue: The root README.md completely lacks any mention of GPU metrics collection despite this being a major new feature. The Message Format section documents four types of objects but makes no mention of the three new GPU metrics.
Required: Add a section documenting GPU metrics in the Message Format section, including metric names, label structure, and usage patterns.

File: helm/README.md
Issue: The Helm chart README has no documentation about GPU metrics configuration. Users need to know about the prometheusConfig.scrapeJobs.gpu.enabled and prometheusConfig.scrapeJobs.gpu.scrapeInterval configuration options.
Required: Add GPU metrics configuration section explaining how to enable GPU collection, what the scrape interval means, and that NVIDIA DCGM Exporter must be deployed separately.

claude · 2025-10-08T23:05:03Z

File: CONFIGURATION.md
Issue: No mention of GPU-related configuration options.
Required: Document the GPU scrape job configuration parameters.

File: helm/values.yaml:558-559
Issue: Comments claim this enables GPU metrics scrape job but don't explain prerequisites (DCGM Exporter must be installed), don't mention the label selector requirement, and don't explain when this should be enabled vs disabled.
Required: More comprehensive inline documentation explaining deployment dependencies and configuration considerations.

claude · 2025-10-08T23:05:08Z

Possibly Broken References

File: helm/README.md:312-317
Issue: The Useful References section lists memory sizing guide, deployment validation guide, istio guide, and release notes - but there's no GPU troubleshooting or configuration guide listed despite this being a complex new feature requiring external dependencies.
Required: Consider adding a GPU metrics guide or at minimum update the deployment validation guide to cover GPU metrics validation.

claude · 2025-10-08T23:05:15Z

Additional Issues

Inconsistent naming convention: The PR introduces cz_ prefix for GPU metrics but the codebase uses cloudzero_ prefix for other custom metrics. This inconsistency is explained in code comments but not in user-facing documentation.
No upgrade path documentation: The PR enables GPU metrics by default as false but doesn't document in the README what happens if users upgrade and have DCGM already deployed.
Limited operational guidance: While the PR description mentions Should this be enabled by default? - this critical question has no documented answer or decision rationale in configuration files.
Vendor lock-in not highlighted: The PR mentions AMD and Intel would need separate scrape jobs, but this architectural limitation isn't documented in user-facing docs where customers make purchasing decisions.

Overall: The code changes appear solid with excellent inline documentation and comprehensive testing, but the user-facing documentation is severely lacking for a feature of this scope and complexity. This will create support burden.

dmepham

cool!

dmepham · 2025-10-08T23:15:33Z

helm/templates/_cm_helpers.tpl

+    - source_labels: [__name__]
+      regex: DCGM_FI_DEV_FB_FREE
+      replacement: cz_gpu_memory_free_bytes
+      target_label: __name__


you mentioned different metrics for different vendors - would it make sense to add a label to the to differentiate? just so it might be easier to know how to process different metrics. just thinking out loud, I don't know if that would actually be helpful

I did! Sorry, I should have mentioned, but I add a provenance=dcgm label on lines 216-220:

# Add provenance label to indicate DCGM as the metric source - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name] regex: dcgm-exporter replacement: dcgm target_label: provenance

I'm kind of wondering if we want to do some normalization in the collector, I don't love the thought of dealing with this complexity in the backend... That has some other benefits, too; I'll start a Slack thread in the morning.

oh whoops, my bad, thanks!

Add complete NVIDIA DCGM GPU metrics support with Prometheus scraping and transformation pipeline. Prometheus Scrape Configuration: - Added DCGM Exporter scrape job with Kubernetes service discovery - Matches services labeled `app.kubernetes.io/name=dcgm-exporter` - Collects DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_FB_USED, DCGM_FI_DEV_FB_FREE - Includes 10 comprehensive Helm unit tests for scrape configuration Metric Transformation Pipeline: - Implemented hexagonal architecture for vendor-agnostic GPU metrics: - MetricTransformer interface in app/types (port) - Catalog transformer with sequential routing (domain service) - DCGM transformer for NVIDIA metrics (adapter) DCGM Transformations: - DCGM_FI_DEV_GPU_UTIL → container_resources_gpu_usage_percent - DCGM_FI_DEV_FB_USED + FREE → container_resources_gpu_memory_usage_percent - UUID label renamed to gpu_uuid with GPU- prefix stripped - Supports both 'node' and 'Hostname' labels for node identification - Memory metrics buffered and calculated as percentages in flush phase Testing: - 96.4% test coverage for DCGM transformer with table-driven tests - Edge cases: missing labels, invalid values, incomplete pairs - 10 Helm unit tests for scrape configuration and metric filtering - Verified on EKS cluster (g4dn.xlarge with Tesla T4 GPU) Documentation: - README.md with transformation tables and processing strategy - CLAUDE.md with AI development guide and architecture context - Documented label transformations, node identification, and edge cases Configuration: - Added 100.0 to golangci-lint allowFloats for percentage conversion - Updated Makefile for transform package test target This provides foundation for multi-vendor GPU support (AMD, Intel) following established architectural patterns in the codebase.

evan-cz requested a review from a team as a code owner October 8, 2025 22:35

evan-cz force-pushed the CP-33636 branch from d136fea to 743bdda Compare October 8, 2025 23:03

dmepham approved these changes Oct 8, 2025

View reviewed changes

evan-cz force-pushed the CP-33636 branch from 743bdda to 7abd79a Compare October 16, 2025 18:48

evan-cz changed the title ~~CP-33636: add DCGM scrape job~~ CP-33636: add DCGM GPU metric collection and transformation Oct 16, 2025

evan-cz force-pushed the CP-33636 branch 2 times, most recently from 758d53e to dfedd89 Compare October 17, 2025 13:31

evan-cz enabled auto-merge October 17, 2025 13:33

evan-cz force-pushed the CP-33636 branch from dfedd89 to 5a27b2e Compare October 24, 2025 15:41

evan-cz force-pushed the CP-33636 branch from 5a27b2e to e932870 Compare October 24, 2025 15:48

evan-cz added this pull request to the merge queue Oct 24, 2025

Merged via the queue into develop with commit c4e1e5f Oct 24, 2025
52 of 53 checks passed

evan-cz deleted the CP-33636 branch October 24, 2025 16:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CP-33636: add DCGM GPU metric collection and transformation #494

CP-33636: add DCGM GPU metric collection and transformation #494

Uh oh!

evan-cz commented Oct 8, 2025 •

edited

Loading

Uh oh!

claude bot commented Oct 8, 2025

Uh oh!

claude bot commented Oct 8, 2025

Uh oh!

claude bot commented Oct 8, 2025

Uh oh!

claude bot commented Oct 8, 2025

Uh oh!

claude bot commented Oct 8, 2025

Uh oh!

claude bot commented Oct 8, 2025

Uh oh!

dmepham left a comment

Uh oh!

dmepham Oct 8, 2025

Uh oh!

evan-cz Oct 8, 2025

Uh oh!

dmepham Oct 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CP-33636: add DCGM GPU metric collection and transformation #494

CP-33636: add DCGM GPU metric collection and transformation #494

Uh oh!

Conversation

evan-cz commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude bot commented Oct 8, 2025

Documentation Review - PR #494

Summary

Missing Documentation Updates

1. helm/README.md - Critical Omission (lines 1-320)

3. CONFIGURATION.md - Incomplete

PR Description Validation

Positive Observations

Recommendations

Uh oh!

claude bot commented Oct 8, 2025

Documentation Issues

PR Description - 0/10

Uh oh!

claude bot commented Oct 8, 2025

Possibly Missing Documentation Updates

Uh oh!

claude bot commented Oct 8, 2025

Uh oh!

claude bot commented Oct 8, 2025

Possibly Broken References

Uh oh!

claude bot commented Oct 8, 2025

Additional Issues

Uh oh!

dmepham left a comment

Choose a reason for hiding this comment

Uh oh!

dmepham Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

evan-cz Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

dmepham Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

evan-cz commented Oct 8, 2025 •

edited

Loading