Skip to content

Conversation

evan-cz
Copy link
Contributor

@evan-cz evan-cz commented Jul 16, 2025

Why?

To automatically scale aggregator replicas up/down.

What

This change adds Kubernetes Horizontal Pod Autoscaler (HPA) support with a custom metrics API implementation for the CloudZero agent. The existing agent relied on manual scaling, which wasn't responsive to actual workload demands and could lead to resource inefficiencies.

The implementation exposes a Kubernetes custom metrics API v1beta1 endpoint that serves the czo_cost_metrics_shipping_progress metric from the collector. This metric represents the percentage of pending metrics relative to the maximum record limit, allowing HPA to scale the aggregator deployment based on actual workload pressure.

Key technical changes include:

  • Custom metrics API handlers implementing the v1beta1 specification
  • Discovery endpoints for API resource enumeration
  • Integration with existing metric collector to expose shipping progress
  • HPA configuration templates with proper RBAC permissions
  • Comprehensive test coverage and documentation

The approach eliminates external dependencies like Prometheus Adapter by implementing the custom metrics API directly in the collector, creating a self-contained autoscaling solution that scales based on the agent's own internal metrics rather than external observability infrastructure.

How Tested

Lots of deploying and, and lots and lots of waiting.

We want to add these eventually, but they're still under development
right now.
@evan-cz evan-cz force-pushed the CP-30604 branch 2 times, most recently from 79ba918 to 899ca17 Compare July 16, 2025 23:39
This change adds Kubernetes Horizontal Pod Autoscaler (HPA) support
with a custom metrics API implementation for the CloudZero agent. The
existing agent relied on manual scaling, which wasn't responsive to
actual workload demands and could lead to resource inefficiencies.

The implementation exposes a Kubernetes custom metrics API v1beta1
endpoint that serves the `czo_cost_metrics_shipping_progress` metric
from the collector. This metric represents the percentage of pending
metrics relative to the maximum record limit, allowing HPA to scale
the aggregator deployment based on actual workload pressure.

Key technical changes include:
 - Custom metrics API handlers implementing the v1beta1 specification
 - Discovery endpoints for API resource enumeration
 - Integration with existing metric collector to expose shipping progress
 - HPA configuration templates with proper RBAC permissions
 - Comprehensive test coverage and documentation

The approach eliminates external dependencies like Prometheus Adapter
by implementing the custom metrics API directly in the collector,
creating a self-contained autoscaling solution that scales based on
the agent's own internal metrics rather than external observability
infrastructure.
costMetricsShippingProgress = promauto.NewGaugeVec(
prometheus.GaugeOpts{
Name: types.ObservabilityMetric("cost_metrics_shipping_progress"),
Help: "Progress towards cost metrics shipping goal (ratio of currentPending/targetProgress), where targetProgress = (elapsedTime/costMaxInterval) * maxRecords, 1.0 = 100% of expected rate",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should add a "buffer" or scale factor, eg (currentPending/targetProgress)*1.05

Nevermind, this is handled properly by targetValue: "900m"

Why am I still posting this comment? Not sure, just to say, this all looks great to me. Nice work!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also working on a simplification where we just base it off of metrics/minute on a 2-minute sliding window, which also lets us do some other fun stuff...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants