Skip to content

Conversation

@evan-cz
Copy link
Contributor

@evan-cz evan-cz commented Oct 8, 2025

CP-33636: add DCGM GPU metrics collection and transformation

Add complete NVIDIA DCGM GPU metrics support with Prometheus scraping
and transformation pipeline.

Prometheus Scrape Configuration:

  • Added DCGM Exporter scrape job with Kubernetes service discovery
  • Matches services labeled app.kubernetes.io/name=dcgm-exporter
  • Collects DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_FB_USED, DCGM_FI_DEV_FB_FREE
  • Includes 10 comprehensive Helm unit tests for scrape configuration

Metric Transformation Pipeline:

  • Implemented hexagonal architecture for vendor-agnostic GPU metrics:
    • MetricTransformer interface in app/types (port)
    • Catalog transformer with sequential routing (domain service)
    • DCGM transformer for NVIDIA metrics (adapter)

DCGM Transformations:

  • DCGM_FI_DEV_GPU_UTIL → container_resources_gpu_usage_percent
  • DCGM_FI_DEV_FB_USED + FREE → container_resources_gpu_memory_usage_percent
  • UUID label renamed to gpu_uuid with GPU- prefix stripped
  • Supports both 'node' and 'Hostname' labels for node identification
  • Memory metrics buffered and calculated as percentages in flush phase

Testing:

  • 96.4% test coverage for DCGM transformer with table-driven tests
  • Edge cases: missing labels, invalid values, incomplete pairs
  • 10 Helm unit tests for scrape configuration and metric filtering
  • Verified on EKS cluster (g4dn.xlarge with Tesla T4 GPU)

Documentation:

  • README.md with transformation tables and processing strategy
  • CLAUDE.md with AI development guide and architecture context
  • Documented label transformations, node identification, and edge cases

Configuration:

  • Added 100.0 to golangci-lint allowFloats for percentage conversion
  • Updated Makefile for transform package test target

This provides foundation for multi-vendor GPU support (AMD, Intel)
following established architectural patterns in the codebase.

@evan-cz evan-cz requested a review from a team as a code owner October 8, 2025 22:35
@claude
Copy link

claude bot commented Oct 8, 2025

Documentation Review - PR #494

Summary

This PR adds NVIDIA GPU metrics collection capability but lacks necessary documentation updates to inform users about this new feature.

Overall Assessment: While the implementation is well-tested and the PR description is thorough, user-facing documentation is incomplete. Users will have no way to discover or understand this new capability without reading the code or schema files.


Missing Documentation Updates

1. helm/README.md - Critical Omission (lines 1-320)

Issue: The Installation Guide and configuration documentation make no mention of GPU metrics collection.

Required: Add a new section documenting the GPU metrics feature:

  • Location: After "Labels and Annotations" section (around line 217)
  • Should include:
    • What GPU metrics are collected (cz_gpu_usage_percent, cz_gpu_memory_used_bytes, cz_gpu_memory_free_bytes)
    • How to enable GPU metrics collection (prometheusConfig.scrapeJobs.gpu.enabled: true)
    • Prerequisites (DCGM Exporter must be deployed with standard labels)
    • Configuration example showing gpu.enabled and gpu.scrapeInterval settings
    • Note about current NVIDIA-only support with future AMD/Intel vendor support planned

Example addition:

### GPU Metrics Collection

The CloudZero Agent can collect NVIDIA GPU metrics from DCGM Exporter to enable GPU cost allocation.

#### Prerequisites
- DCGM Exporter deployed in your cluster (e.g., via the `dcgm-exporter` Helm chart)
- DCGM Exporter services labeled with `app.kubernetes.io/name=dcgm-exporter`

#### Collected Metrics
- `cz_gpu_usage_percent` - GPU compute utilization (0-100%)
- `cz_gpu_memory_used_bytes` - GPU memory used per GPU (in MiB)
- `cz_gpu_memory_free_bytes` - GPU memory free per GPU (in MiB)

#### Configuration
```yaml
prometheusConfig:
  scrapeJobs:
    gpu:
      enabled: true          # Default: false
      scrapeInterval: 30s    # Default: 30s

Note: Currently supports NVIDIA GPUs only. AMD and Intel GPU support planned for future releases.


---

#### 2. **README.md** (Root) - Missing from Feature List
**Issue**: The root README lists the main components (Webhook Server, Collector, Shipper, Validator) but doesn't mention that the Collector now supports GPU metrics.

**Required**: Update the "CloudZero Collector" description (around line 14) to mention GPU metrics support:
```markdown
- _CloudZero Collector_ - The collector application which implements a prometheus compliant interface for metrics collection; which writes the metrics payloads to files to a shared location for consumption by the shipper. The collector classifies incoming metrics data, and will save the data into either cost telemetry files, or into observability files. **Supports container metrics, node metrics, and GPU metrics (NVIDIA DCGM Exporter) for comprehensive cost allocation.** These files are compressed on disk to save space.

3. CONFIGURATION.md - Incomplete

Issue: The configuration guide documents the validator configuration but does not cover user-configurable Helm values for GPU metrics.

Required: While this file focuses on validator config, consider adding a reference to helm/README.md for Prometheus scrape job configuration, or add a brief note about GPU metrics being configurable via prometheusConfig.scrapeJobs.gpu.* settings.


PR Description Validation

PR description accurately reflects implementation:

  • Correctly describes the three DCGM metrics collected and their renaming
  • Accurately explains the label selector mechanism
  • Properly notes limitations (NVIDIA-only, no recording rules in agent mode)
  • Testing coverage is comprehensive

PR description mentions "Questions" section:

  • "Should this be enabled by default?" - The answer (default: false) is implemented correctly given that DCGM Exporter is not universally deployed

Positive Observations

  1. Excellent inline documentation: The _cm_helpers.tpl template has exceptional comments explaining the scrape job configuration
  2. Comprehensive testing: 10 helm-unittest tests provide thorough coverage
  3. Schema validation: Both JSON and YAML schemas updated correctly
  4. Metric naming: The cz_ prefix is well-chosen to avoid conflicts

Recommendations

Before merge:

  1. Update helm/README.md with GPU metrics documentation section
  2. Update root README.md to mention GPU metrics support in Collector description

Consider for follow-up:

  1. Create a dedicated GPU metrics guide in helm/docs/gpu-metrics.md if this feature expands
  2. Update any customer-facing documentation or migration guides
  3. Add GPU metrics to the troubleshooting guide (helm/docs/troubleshooting-guide.md)

Conclusion: Implementation quality is high, but user-facing documentation must be updated before merge to ensure customers can discover and configure this new capability.

@claude
Copy link

claude bot commented Oct 8, 2025

Documentation Issues

PR Description - 0/10

This PR needs significant documentation improvements across multiple files.

@claude
Copy link

claude bot commented Oct 8, 2025

Possibly Missing Documentation Updates

File: README.md
Issue: The root README.md completely lacks any mention of GPU metrics collection despite this being a major new feature. The Message Format section documents four types of objects but makes no mention of the three new GPU metrics.
Required: Add a section documenting GPU metrics in the Message Format section, including metric names, label structure, and usage patterns.


File: helm/README.md
Issue: The Helm chart README has no documentation about GPU metrics configuration. Users need to know about the prometheusConfig.scrapeJobs.gpu.enabled and prometheusConfig.scrapeJobs.gpu.scrapeInterval configuration options.
Required: Add GPU metrics configuration section explaining how to enable GPU collection, what the scrape interval means, and that NVIDIA DCGM Exporter must be deployed separately.

@claude
Copy link

claude bot commented Oct 8, 2025

File: CONFIGURATION.md
Issue: No mention of GPU-related configuration options.
Required: Document the GPU scrape job configuration parameters.


File: helm/values.yaml:558-559
Issue: Comments claim this enables GPU metrics scrape job but don't explain prerequisites (DCGM Exporter must be installed), don't mention the label selector requirement, and don't explain when this should be enabled vs disabled.
Required: More comprehensive inline documentation explaining deployment dependencies and configuration considerations.

@claude
Copy link

claude bot commented Oct 8, 2025

Possibly Broken References

File: helm/README.md:312-317
Issue: The Useful References section lists memory sizing guide, deployment validation guide, istio guide, and release notes - but there's no GPU troubleshooting or configuration guide listed despite this being a complex new feature requiring external dependencies.
Required: Consider adding a GPU metrics guide or at minimum update the deployment validation guide to cover GPU metrics validation.

@claude
Copy link

claude bot commented Oct 8, 2025

Additional Issues

  1. Inconsistent naming convention: The PR introduces cz_ prefix for GPU metrics but the codebase uses cloudzero_ prefix for other custom metrics. This inconsistency is explained in code comments but not in user-facing documentation.

  2. No upgrade path documentation: The PR enables GPU metrics by default as false but doesn't document in the README what happens if users upgrade and have DCGM already deployed.

  3. Limited operational guidance: While the PR description mentions Should this be enabled by default? - this critical question has no documented answer or decision rationale in configuration files.

  4. Vendor lock-in not highlighted: The PR mentions AMD and Intel would need separate scrape jobs, but this architectural limitation isn't documented in user-facing docs where customers make purchasing decisions.

Overall: The code changes appear solid with excellent inline documentation and comprehensive testing, but the user-facing documentation is severely lacking for a feature of this scope and complexity. This will create support burden.

Copy link
Contributor

@dmepham dmepham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool!

- source_labels: [__name__]
regex: DCGM_FI_DEV_FB_FREE
replacement: cz_gpu_memory_free_bytes
target_label: __name__
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mentioned different metrics for different vendors - would it make sense to add a label to the to differentiate? just so it might be easier to know how to process different metrics. just thinking out loud, I don't know if that would actually be helpful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did! Sorry, I should have mentioned, but I add a provenance=dcgm label on lines 216-220:

    # Add provenance label to indicate DCGM as the metric source
    - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name]
      regex: dcgm-exporter
      replacement: dcgm
      target_label: provenance

I'm kind of wondering if we want to do some normalization in the collector, I don't love the thought of dealing with this complexity in the backend... That has some other benefits, too; I'll start a Slack thread in the morning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh whoops, my bad, thanks!

@evan-cz evan-cz changed the title CP-33636: add DCGM scrape job CP-33636: add DCGM GPU metric collection and transformation Oct 16, 2025
@evan-cz evan-cz force-pushed the CP-33636 branch 2 times, most recently from 758d53e to dfedd89 Compare October 17, 2025 13:31
@evan-cz evan-cz enabled auto-merge October 17, 2025 13:33
Add complete NVIDIA DCGM GPU metrics support with Prometheus scraping
and transformation pipeline.

Prometheus Scrape Configuration:
- Added DCGM Exporter scrape job with Kubernetes service discovery
- Matches services labeled `app.kubernetes.io/name=dcgm-exporter`
- Collects DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_FB_USED, DCGM_FI_DEV_FB_FREE
- Includes 10 comprehensive Helm unit tests for scrape configuration

Metric Transformation Pipeline:
- Implemented hexagonal architecture for vendor-agnostic GPU metrics:
  - MetricTransformer interface in app/types (port)
  - Catalog transformer with sequential routing (domain service)
  - DCGM transformer for NVIDIA metrics (adapter)

DCGM Transformations:
- DCGM_FI_DEV_GPU_UTIL → container_resources_gpu_usage_percent
- DCGM_FI_DEV_FB_USED + FREE → container_resources_gpu_memory_usage_percent
- UUID label renamed to gpu_uuid with GPU- prefix stripped
- Supports both 'node' and 'Hostname' labels for node identification
- Memory metrics buffered and calculated as percentages in flush phase

Testing:
- 96.4% test coverage for DCGM transformer with table-driven tests
- Edge cases: missing labels, invalid values, incomplete pairs
- 10 Helm unit tests for scrape configuration and metric filtering
- Verified on EKS cluster (g4dn.xlarge with Tesla T4 GPU)

Documentation:
- README.md with transformation tables and processing strategy
- CLAUDE.md with AI development guide and architecture context
- Documented label transformations, node identification, and edge cases

Configuration:
- Added 100.0 to golangci-lint allowFloats for percentage conversion
- Updated Makefile for transform package test target

This provides foundation for multi-vendor GPU support (AMD, Intel)
following established architectural patterns in the codebase.
@evan-cz evan-cz added this pull request to the merge queue Oct 24, 2025
Merged via the queue into develop with commit c4e1e5f Oct 24, 2025
52 of 53 checks passed
@evan-cz evan-cz deleted the CP-33636 branch October 24, 2025 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants