Skip to content

Conversation

thandleman-r7
Copy link

@thandleman-r7 thandleman-r7 commented Sep 24, 2025

Why?

We are deploying the CloudZero Agent into multiple clusters that have an Istio service mesh deployed. We noticed that the backfill job was continuously throwing an error that the webhook API was not available:

image (11)

Upon further investigation, we notice that Istio was throwing errors about traffic being redirected to the BlackholeCluster i.e. traffic was not being allowed to continue on to the Service. Below is the output we saw in the istio-proxy container on the backfill job:

{"x_b3_sampled":null,"user_agent":null,"upstream_cluster":"BlackHoleCluster;","method":null,"downstream_remote_address":"10.0.180.113:40216","bytes_sent":0,"request_duration":null,"response_flags":"UH","protocol":null,"authority":null,"path":null,"upstream_host":null,"requested_server_name":"cloudzero-agent-webhook-server-svc.platform-delivery.svc.cluster.local","bytes_received":0,"response_duration":null,"x_b3_parentspanid":null,"x_forwarded_for":null,"x_b3_traceid":null,"downstream_local_address":"172.20.70.217:443","response_tx_duration":null,"response_code":0,"request_id":null,"start_time":"2025-09-24T18:35:18.852Z","duration":0,"upstream_local_address":null,"connection_termination_details":null,"x_b3_spanid":null}

However, this was odd as this is in cluster traffic. We have seen a similar error with other third party deployments. Looking closer, we notice that the name of the port for the Agent Webhook Server Service was hardcoded to http:

Istio will attempt to use the name of the port for a service to determine which protocol to use to handle the traffic: https://istio.io/latest/docs/ops/configuration/traffic-management/protocol-selection/#explicit-protocol-selection

However, since the backfill cronjob is attempting to establish a TLS connection with the Service, this hardcoded http value is causing failures. Simply editing the name of the port to https and triggering a new backfill job worked immediately.

We chose to go this route instead of disabling Istio injection on the job pod. We would prefer to keep it on.

What

This adds a simple change that allows us to rename the port of the service, while maintaining the original hardcoded value via .Values.insightsController.service.portName

How Tested

First, we ensured this was the issue manually in our cluster. We edited the port name on the Service and restarted the job. It immediately began working.

I further confirmed this change to the chart works by creating a simple YAML overrides file that overrode the port name, and confirmed the expected results by running helm template -f overrides.yaml ... and inspecting the Service in the resulting manifest.

@thandleman-r7 thandleman-r7 requested a review from a team as a code owner September 24, 2025 20:39
@thandleman-r7
Copy link
Author

I only just noticed this section in the docs: https://github.com/Cloudzero/cloudzero-agent/blob/develop/helm/docs/istio.md#additional-configuration-options

Seems related, but from what we can see, the backfill doesnt seem to be getting tripped up once the name of the port is https

@jake-cloudzero
Copy link
Contributor

jake-cloudzero commented Sep 25, 2025

@thandleman-r7 Thank you so much for not only the in-depth explanation of the problem, but also providing us a solution. We all really appreciated this approach.

This is really helpful for us as we dive deeper into making our chart more compatible with Itsio, as quite a bit of our customers use it. I was unaware of Itsio's semantics when selecting a protocol for a particular port, and inside the documentation you provided this section looks particularly interesting:

This can be configured in two ways:

By the name of the port: name: <protocol>[-<suffix>].
In Kubernetes 1.18+, by the appProtocol field: appProtocol: <protocol>.
If both are defined, appProtocol takes precedence over the port name.

While changing the port name in this instance will most likely be fine, there are other places in which we rely on various port names to be consistent, and would want to try and avoid changing these unless completely necessary. Have you tried adding appProtocol: "https" to the service?

If so, we would love to get this change in sooner rather than later, and would probably explicitly define the protocol we use through all of our services.

@thandleman-r7
Copy link
Author

I have not, will test this out shortly

@thandleman-r7 thandleman-r7 force-pushed the configure-webhhook-server-svc-port-name branch from 73b95f0 to d161d9a Compare September 25, 2025 22:04
@thandleman-r7
Copy link
Author

Confirmed that setting appProtocol: https also seemed to work. I have pushed the change up.

@thandleman-r7 thandleman-r7 changed the title Allow configuration of the port name for the webhook server Service. Explicitly set appProtocol for the webhook server service Sep 25, 2025
@thandleman-r7
Copy link
Author

@jake-cloudzero Are you waiting for anything on my side?

@jake-cloudzero
Copy link
Contributor

@thandleman-r7 Looks good from us. We have some CI things we need to work out before our checks will work for forks. We are working to get those done. In the meantime we are going to get this through today: #485

@jake-cloudzero
Copy link
Contributor

Update, this was merged into 1.2.8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants