-
Notifications
You must be signed in to change notification settings - Fork 9
CP-33636: add DCGM GPU metric collection and transformation #494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Documentation Review - PR #494SummaryThis PR adds NVIDIA GPU metrics collection capability but lacks necessary documentation updates to inform users about this new feature. Overall Assessment: While the implementation is well-tested and the PR description is thorough, user-facing documentation is incomplete. Users will have no way to discover or understand this new capability without reading the code or schema files. Missing Documentation Updates1. helm/README.md - Critical Omission (lines 1-320)Issue: The Installation Guide and configuration documentation make no mention of GPU metrics collection. Required: Add a new section documenting the GPU metrics feature:
Example addition: ### GPU Metrics Collection
The CloudZero Agent can collect NVIDIA GPU metrics from DCGM Exporter to enable GPU cost allocation.
#### Prerequisites
- DCGM Exporter deployed in your cluster (e.g., via the `dcgm-exporter` Helm chart)
- DCGM Exporter services labeled with `app.kubernetes.io/name=dcgm-exporter`
#### Collected Metrics
- `cz_gpu_usage_percent` - GPU compute utilization (0-100%)
- `cz_gpu_memory_used_bytes` - GPU memory used per GPU (in MiB)
- `cz_gpu_memory_free_bytes` - GPU memory free per GPU (in MiB)
#### Configuration
```yaml
prometheusConfig:
scrapeJobs:
gpu:
enabled: true # Default: false
scrapeInterval: 30s # Default: 30sNote: Currently supports NVIDIA GPUs only. AMD and Intel GPU support planned for future releases. 3. CONFIGURATION.md - IncompleteIssue: The configuration guide documents the validator configuration but does not cover user-configurable Helm values for GPU metrics. Required: While this file focuses on validator config, consider adding a reference to PR Description Validation✅ PR description accurately reflects implementation:
❌ PR description mentions "Questions" section:
Positive Observations
RecommendationsBefore merge:
Consider for follow-up:
Conclusion: Implementation quality is high, but user-facing documentation must be updated before merge to ensure customers can discover and configure this new capability. |
Documentation IssuesPR Description - 0/10This PR needs significant documentation improvements across multiple files. |
Possibly Missing Documentation UpdatesFile: README.md File: helm/README.md |
|
File: CONFIGURATION.md File: helm/values.yaml:558-559 |
Possibly Broken ReferencesFile: helm/README.md:312-317 |
Additional Issues
Overall: The code changes appear solid with excellent inline documentation and comprehensive testing, but the user-facing documentation is severely lacking for a feature of this scope and complexity. This will create support burden. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool!
helm/templates/_cm_helpers.tpl
Outdated
| - source_labels: [__name__] | ||
| regex: DCGM_FI_DEV_FB_FREE | ||
| replacement: cz_gpu_memory_free_bytes | ||
| target_label: __name__ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mentioned different metrics for different vendors - would it make sense to add a label to the to differentiate? just so it might be easier to know how to process different metrics. just thinking out loud, I don't know if that would actually be helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did! Sorry, I should have mentioned, but I add a provenance=dcgm label on lines 216-220:
# Add provenance label to indicate DCGM as the metric source
- source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_name]
regex: dcgm-exporter
replacement: dcgm
target_label: provenanceI'm kind of wondering if we want to do some normalization in the collector, I don't love the thought of dealing with this complexity in the backend... That has some other benefits, too; I'll start a Slack thread in the morning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh whoops, my bad, thanks!
758d53e to
dfedd89
Compare
Add complete NVIDIA DCGM GPU metrics support with Prometheus scraping and transformation pipeline. Prometheus Scrape Configuration: - Added DCGM Exporter scrape job with Kubernetes service discovery - Matches services labeled `app.kubernetes.io/name=dcgm-exporter` - Collects DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_FB_USED, DCGM_FI_DEV_FB_FREE - Includes 10 comprehensive Helm unit tests for scrape configuration Metric Transformation Pipeline: - Implemented hexagonal architecture for vendor-agnostic GPU metrics: - MetricTransformer interface in app/types (port) - Catalog transformer with sequential routing (domain service) - DCGM transformer for NVIDIA metrics (adapter) DCGM Transformations: - DCGM_FI_DEV_GPU_UTIL → container_resources_gpu_usage_percent - DCGM_FI_DEV_FB_USED + FREE → container_resources_gpu_memory_usage_percent - UUID label renamed to gpu_uuid with GPU- prefix stripped - Supports both 'node' and 'Hostname' labels for node identification - Memory metrics buffered and calculated as percentages in flush phase Testing: - 96.4% test coverage for DCGM transformer with table-driven tests - Edge cases: missing labels, invalid values, incomplete pairs - 10 Helm unit tests for scrape configuration and metric filtering - Verified on EKS cluster (g4dn.xlarge with Tesla T4 GPU) Documentation: - README.md with transformation tables and processing strategy - CLAUDE.md with AI development guide and architecture context - Documented label transformations, node identification, and edge cases Configuration: - Added 100.0 to golangci-lint allowFloats for percentage conversion - Updated Makefile for transform package test target This provides foundation for multi-vendor GPU support (AMD, Intel) following established architectural patterns in the codebase.
CP-33636: add DCGM GPU metrics collection and transformation
Add complete NVIDIA DCGM GPU metrics support with Prometheus scraping
and transformation pipeline.
Prometheus Scrape Configuration:
app.kubernetes.io/name=dcgm-exporterMetric Transformation Pipeline:
DCGM Transformations:
Testing:
Documentation:
Configuration:
This provides foundation for multi-vendor GPU support (AMD, Intel)
following established architectural patterns in the codebase.