-
Notifications
You must be signed in to change notification settings - Fork 491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cryptic SDS Error in Spire Agent #5709
Comments
Could you check if there are any attestation failures? This issue might be related to #5638. |
Hey @MarcosDY, no attestation failures as far as I could tell. Any additional logs were not errors or warnings unfortunately and I'm not sure how much help they would be, but let me see about getting them. To switch gears, this is in fact related to the issue you posted actually, in fact the liveness probe's existence is because of #5638. I actually happen to have a permanent fix for that issue. I've tested it on a k8s cluster here, but let me summarize the problem and the fix. The original problem in #5638 related to failures of the spire-agent in retrieving federated bundles if it lost contact with the kubelet for example, thereby when contact with the kubelet is restored, the agent gives the workload the bundle, but not the federated bundle. The culprit is actually in this function here where the agent would return the bundle but not the federated bundle since the workload triggering the SDS endpoint would not have its identity assigned when the agent lost contact with the kubelet. You can verify this by doing the following in a k8s cluster in an EKS environment (or similar):
The fix would involve:
I've tested something like this on a k8s cluster and it works, and if this approach sounds acceptable, I can make a PR here too. A solution for that problem would really help, as the only way we could think of to guard against this problem is having a liveness probe to check if the federated bundle does not exist. But a liveness probe or a similar mechanism that simply forces re-attestation by restarting is not a good long-term solution and leads to problems of its own (such as likely, the one mentioned in this issue originally). |
spire-helm-charts-hardened version: 0.21.0 and 0.15.1
spire-agent version: v1.9.6 and v1.8.4
spire-server version: v1.9.6 and v1.8.4
subsystem: spire-agent
We're running Spire in tandem with Istio in an EKS environment. Our Istio sidecar proxies are connected to spire-agents via a workload socket. We use Spire to retrieve x509 certs/trust bundles as well as federated trust bundles.
We've also configured a liveness probe on our istio-proxies with the following command:
This command essentially tries to check the envoy config dump and then check for the existence of the federated trust domain in the config dump. We have this check to guard against potential problems we had seen with the federated trust bundle not being appended for certain workloads.
We've noticed this liveness probe failing for workloads whenever the error below for example occurs in the corresponding spire-agent pods:
We've noticed very few spire-agent pods (1-2 in a 15 node cluster for example) having this error regularly however it does seem to be correlated with failures. We've also noticed that the nodes that have a spire-agent reporting this error do not seem to have any abnormal cpu/memory usage nor do we see anything abnormal in terms of cpu/memory from the workloads themselves. Would anyone know what this error means and why it would appear regularly for certain spire-agent/nodes?
The text was updated successfully, but these errors were encountered: