-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xDS: v1.66.2 and above break most xDS client gRPC requests #7691
Comments
Would you be able to provide some logs for us? With the following env vars set: Thanks. |
The error mentioned in the issue description happens when connection to the management server fails. https://github.com/grpc/grpc-go/blob/ca4865d6dd6f3d8b77f1943ccfd6c9e78223912d/xds/internal/xdsclient/authority.go#L462C1-L463C1 |
I enabled grpc-go version 1.67.2 and here are the logs leading up to the error:
|
vs 1.65.0:
|
Thanks for the logs.
The above line seem to indicate to me that the server is closing the stream for whatever reason. The above is logged when the xDS client runs into an error when attempting to read from the ADS stream. |
Does your server logs contain anything useful? |
Could this be related (blanked out some of the ip parts)?
|
Interesting. Is this from your server? But from the client logs, it seems like the client is not even sending one ADS request, right? |
Seems like it since like it since the working version shows ADS request sent. I'll ask our infra structure team whether there are certain rate limits in play. |
There were a few changes that could be of interest:
|
Thanks for the reference, I'll see also if the metric can located here that shows the ADS request rate to confirm there's a link to this issue. |
Adjusted the title as there might be some pods able to make requests. I will gather more information on the gRPC service and client pod counts. |
backgroundQA GKE cluster with limited number of pods (to keep cost down). There are 3 pod/service types that were upgraded to with different service dependencies:
e.g. the grpc-go change to create separate xDS clients for each service dependencies brings with it additional ADS request load on the istiod side. client side observations
istiod obverations
So in summary, it doesn't seem like the istiod side logged ADS rate limit errors account for a near 100% failure rate of gRPC calls for xDS clients. Also, if there's a temporary case of the 100/s rate limit for ADS requests being exceeded (potentially due to the increased number of xDS clients per > 1.65.0 changes) I would expect the services eventually being able to recover (and get ADS responses). In this QA scenario they don't seem to be able to. |
Thanks for the detailed report. So, if the stream errors seen by gRPC are not because of rate limit errors, does the istiod logs have anything useful about why it is closing streams? gRPC is seeing EOF on stream reads, so there is not anything else useful from gRPC logs about why the stream failed. Is there anyway you can provide us with a repro that we can use on our side? I personally don't have much experience with configuring and running istiod. |
The istiod logs are not showing much besides rate limit warnings (no errors). What is interesting though what's not logged: I see :"ADS: new delta connection for node:..." for services that run a plain istio-proxy side car but not for the service that has problems (it uses the istio-proxy in agent mode). Will make attempt to reproduce the problem with some dummy services/clients and minikube. |
That is interesting. gRPC does not support the delta variant of the xDS protocol. gRPC currently only supports the SotW variant. So, maybe the delta connection is originating from something else? |
I compared the istio-proxy logs for agent vs "normal" mode and found this difference:
Going to try running the service with a plain proxy and 1.67.1 and see what happens; at least it would narrow the problem down. |
Switching to non-agent opened another can of worms so sticking with agent mode for now (this is also what we prefer as these are very high volume services that get bogged down by the overhead of the proxy using up more CPU than the service container itself processing every byte going in and out) The agent logs only these lines
vs the v1.65.0 scenario shows more work being done:
istio/istio#37152 has a comment on this:
This sounds connected to the change to make different xDS clients per service name connected to? |
One issue that might be happening is that istiod is not expecting multiple xDS connections with the same node is, in our case it's set as:
|
In short, the istio-agent seems to be closing the xDS clients beyond the first due to the single node_id used. #7347 requires quite some understanding of both grpc-go and istio xDS as to know how multiple named clients would work, when the agent doesn't seem to be aware of the "name" part of the xDS clients. This is the typical startup behavior of the istio agent where the first xDS connection works but all subsequent attempts get closed (https://github.com/istio/istio/blob/270710c2ec6495770a6f30e6616011719e580162/pkg/istio-agent/xds_proxy.go#L255) , hence the gRPC client call xds EOF failures:
|
The "name" is completely local to grpc and is not part of the xDS protocol. So, the name will not be communicated to the xDS peer. Do you know why the istio-agent closes xDS clients beyond the first one? Can it be configured to allow multiple xDS clients with the same node ID? All xDS clients from the same gRPC binary should be using the same node ID. |
Note sure; It might just an assumption that a pod should only open one xDS connection during its lifetime. Opening up a ticket on the istio project might clarify or provide some recommendations, linking back to the this ticket. edit: istio/istio#53532 |
Thanks for filing the issue with istio. We do feel that it is a bug on the istio side to close xDS connections from gRPC other than the first one. Let's see what they say. |
This issue is show-stopper for go-grpc upgrade. Are there some workaround? |
What version of gRPC are you using?
v1.65.0 works, 1.66.2/1.67.0 cause most gRPC xDS based requests to fail with error
What version of Go are you using (
go version
)?1.22.6
What operating system (Linux, Windows, …) and version?
Linux (Google GKE)
What did you do?
We have go service pods that call out to other services using the istio agent (inject.istio.io/templates:a grpc-agent), prefix the service urls with "xds:///" and import _ "google.golang.org/grpc/xds".
istiod-1-22-4
What did you expect to see?
Succesfull gRPC requests load balanced using xDS
What did you see instead?
90% gRPC failure rate with error: rpc error: code = Unavailable desc = xds: error received from xDS stream: EOF
The text was updated successfully, but these errors were encountered: