CP-30604: implement HPA autoscaling with custom metrics API #359

evan-cz · 2025-07-16T20:05:09Z

Why?

To automatically scale aggregator replicas up/down.

What

This change adds Kubernetes Horizontal Pod Autoscaler (HPA) support with a custom metrics API implementation for the CloudZero agent. The existing agent relied on manual scaling, which wasn't responsive to actual workload demands and could lead to resource inefficiencies.

The implementation exposes a Kubernetes custom metrics API v1beta1 endpoint that serves the czo_cost_metrics_shipping_progress metric from the collector. This metric represents the percentage of pending metrics relative to the maximum record limit, allowing HPA to scale the aggregator deployment based on actual workload pressure.

Key technical changes include:

Custom metrics API handlers implementing the v1beta1 specification
Discovery endpoints for API resource enumeration
Integration with existing metric collector to expose shipping progress
HPA configuration templates with proper RBAC permissions
Comprehensive test coverage and documentation

The approach eliminates external dependencies like Prometheus Adapter by implementing the custom metrics API directly in the collector, creating a self-contained autoscaling solution that scales based on the agent's own internal metrics rather than external observability infrastructure.

How Tested

Lots of deploying and, and lots and lots of waiting.

We want to add these eventually, but they're still under development right now.

This change adds Kubernetes Horizontal Pod Autoscaler (HPA) support with a custom metrics API implementation for the CloudZero agent. The existing agent relied on manual scaling, which wasn't responsive to actual workload demands and could lead to resource inefficiencies. The implementation exposes a Kubernetes custom metrics API v1beta1 endpoint that serves the `czo_cost_metrics_shipping_progress` metric from the collector. This metric represents the percentage of pending metrics relative to the maximum record limit, allowing HPA to scale the aggregator deployment based on actual workload pressure. Key technical changes include: - Custom metrics API handlers implementing the v1beta1 specification - Discovery endpoints for API resource enumeration - Integration with existing metric collector to expose shipping progress - HPA configuration templates with proper RBAC permissions - Comprehensive test coverage and documentation The approach eliminates external dependencies like Prometheus Adapter by implementing the custom metrics API directly in the collector, creating a self-contained autoscaling solution that scales based on the agent's own internal metrics rather than external observability infrastructure.

JonParsons11350 · 2025-07-17T17:17:56Z

app/domain/metric_collector.go

+	costMetricsShippingProgress = promauto.NewGaugeVec(
+		prometheus.GaugeOpts{
+			Name: types.ObservabilityMetric("cost_metrics_shipping_progress"),
+			Help: "Progress towards cost metrics shipping goal (ratio of currentPending/targetProgress), where targetProgress = (elapsedTime/costMaxInterval) * maxRecords, 1.0 = 100% of expected rate",


I'm wondering if we should add a "buffer" or scale factor, eg (currentPending/targetProgress)*1.05

Nevermind, this is handled properly by targetValue: "900m"

Why am I still posting this comment? Not sure, just to say, this all looks great to me. Nice work!

I'm also working on a simplification where we just base it off of metrics/minute on a 2-minute sliding window, which also lets us do some other fun stuff...

Temporarily ignore AI rules

6fdfff9

We want to add these eventually, but they're still under development right now.

evan-cz force-pushed the CP-30604 branch 2 times, most recently from 79ba918 to 899ca17 Compare July 16, 2025 23:39

evan-cz force-pushed the CP-30604 branch from 899ca17 to 5217a0f Compare July 17, 2025 00:59

JonParsons11350 reviewed Jul 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CP-30604: implement HPA autoscaling with custom metrics API #359

CP-30604: implement HPA autoscaling with custom metrics API #359

Uh oh!

evan-cz commented Jul 16, 2025

Uh oh!

JonParsons11350 Jul 17, 2025

Uh oh!

evan-cz Jul 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CP-30604: implement HPA autoscaling with custom metrics API #359

Are you sure you want to change the base?

CP-30604: implement HPA autoscaling with custom metrics API #359

Uh oh!

Conversation

evan-cz commented Jul 16, 2025

Why?

What

How Tested

Uh oh!

JonParsons11350 Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

evan-cz Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants