Skip to content

Conversation

@evan-cz
Copy link
Contributor

@evan-cz evan-cz commented Jan 28, 2026

Customer reported pods entering CrashLoopBackOff with FailedPostStartHook events when deploying the CloudZero agent. Investigation revealed the validator's postStart hook was attempting to reach a webhook service that didn't exist, causing DNS lookup failures and ~70 second delays due to retry logic.

Functional Change:

Before: The validator ConfigMap referenced cloudzero-agent-cz-webhook-svc for the webhook service, but the actual service was named cloudzero-agent-cz-webhook (no -svc suffix). This caused the webhook_server_reachable check to fail on every deployment, blocking startup for ~70 seconds while retries exhausted.

After: The validator ConfigMap correctly references the webhook service using the same helper function as the service definition, ensuring names always match.

Root Cause:

The validator-cm.yaml template (line 45) was introduced in commit 90e1bce (April 2025) with a hardcoded -svc suffix that never matched the actual service name:

insights_service: {{ include "cloudzero-agent.insightsController.server.webhookFullname" . }}-svc

The webhook service in webhook-service.yaml uses:

name: {{ include "cloudzero-agent.serviceName" . }}

Both helpers resolve to the same base name (release-cz-webhook), but the validator template erroneously appended -svc, causing DNS lookup failures. The bug went unnoticed because:

  1. The enforce flag for post-start stage is false, so failures don't crash pods
  2. The check eventually times out after ~70 seconds and returns nil
  3. Federated mode deployments skip the webhook check entirely
  4. Warning-level logs were easily missed

Solution:

  1. Changed validator-cm.yaml line 45 to use the correct helper without suffix: insights_service: {{ include "cloudzero-agent.serviceName" . }}

  2. Added regression test (helm/tests/validator_insights_service_test.yaml) with 5 test cases verifying:

    • insights_service matches expected pattern with default release name
    • webhook service name matches the same pattern
    • insights_service matches with custom release names
    • insights_service does NOT contain -svc suffix (regression guard)

Validation:

  • All tests pass, including new ones.
  • Manual verification: helm template test-release ./helm --set apiKey=test-key shows insights_service: test-release-cz-webhook (no -svc suffix)
  • No new test failures introduced (pre-existing failures unrelated to this change)

@evan-cz evan-cz requested a review from a team as a code owner January 28, 2026 16:20
Customer reported pods entering CrashLoopBackOff with FailedPostStartHook events
when deploying the CloudZero agent. Investigation revealed the validator's
postStart hook was attempting to reach a webhook service that didn't exist,
causing DNS lookup failures and ~70 second delays due to retry logic.

Functional Change:

Before: The validator ConfigMap referenced `cloudzero-agent-cz-webhook-svc` for
the webhook service, but the actual service was named `cloudzero-agent-cz-webhook`
(no `-svc` suffix). This caused the webhook_server_reachable check to fail on
every deployment, blocking startup for ~70 seconds while retries exhausted.

After: The validator ConfigMap correctly references the webhook service using
the same helper function as the service definition, ensuring names always match.

Root Cause:

The validator-cm.yaml template (line 45) was introduced in commit 90e1bce
(April 2025) with a hardcoded `-svc` suffix that never matched the actual
service name:

```yaml
insights_service: {{ include "cloudzero-agent.insightsController.server.webhookFullname" . }}-svc
```

The webhook service in webhook-service.yaml uses:
```yaml
name: {{ include "cloudzero-agent.serviceName" . }}
```

Both helpers resolve to the same base name (`release-cz-webhook`), but the
validator template erroneously appended `-svc`, causing DNS lookup failures.
The bug went unnoticed because:

1. The enforce flag for post-start stage is `false`, so failures don't crash pods
2. The check eventually times out after ~70 seconds and returns nil
3. Federated mode deployments skip the webhook check entirely
4. Warning-level logs were easily missed

Solution:

1. Changed validator-cm.yaml line 45 to use the correct helper without suffix:
   `insights_service: {{ include "cloudzero-agent.serviceName" . }}`

2. Added regression test (helm/tests/validator_insights_service_test.yaml) with
   5 test cases verifying:
   - insights_service matches expected pattern with default release name
   - webhook service name matches the same pattern
   - insights_service matches with custom release names
   - insights_service does NOT contain `-svc` suffix (regression guard)

Validation:

- All tests pass, including new ones.
- Manual verification: `helm template test-release ./helm --set apiKey=test-key`
  shows `insights_service: test-release-cz-webhook` (no `-svc` suffix)
- No new test failures introduced (pre-existing failures unrelated to this change)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants