Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] MetricsRetriever does not handle partial metric retrieval failures gracefully #28497

Open
adammw opened this issue Aug 15, 2024 · 0 comments

Comments

@adammw
Copy link
Contributor

adammw commented Aug 15, 2024

Agent Environment
Datadog Cluster Agent v0.55.3

Describe what happened:
We often receive a burst of "Unexpected error, query data not found in result" errors in our logs for various metric queries all at the same timestamp. This in turn generates FailedGetExternalMetric events in Kubernetes, which fire off alerts to the engineering teams responsible for the relevant metrics. The metrics_retriever code says "this should never happen": https://github.com/DataDog/datadog-agent/blob/7.55.x/pkg/clusteragent/autoscaling/externalmetrics/metrics_retriever.go#L190-L192 - however I suspect it occurs because there can be a partial failure scenario where some metrics are successful but others are not, and the code only assumes global (ie total) errors will occur.

In order to investigate the issue further, I deployed a custom build of Datadog Cluster Agent with the following patch to our staging environment, which revealed the partial failures were due to rate-limiting:

API error 429 Too Many Requests: {"status":"error","code":429,"errors":["Too many requests"],"statuspage":"http://status.datadoghq.com","twitter":"http://twitter.com/datadogops","email":"[email protected]"}

Describe what you expected:
Datadog Cluster Agent logs the error it receives from the API in the case of partial failures, and gracefully handles this condition by retrying later without raising a FailedGetExternalMetric (or if it does, with the reason being rate limiting so we can route it differently).

Steps to reproduce the issue:
Unknown, as it requires a combination of a successful metric query and a server error. Outside of Datadog, can likely only be reproduced in test.

Additional environment details (Operating System, Cloud provider, etc):
Support Ticket: https://help.datadoghq.com/hc/en-us/requests/1808765
Operating System: Ubuntu 22.04
Kubernetes Version: v1.29.6
Cloud Provider: AWS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant