Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ArgoCD] Openmetrics integration in Datadog times out after 10 seconds #17599

Open
ricardojdsilva87 opened this issue May 20, 2024 · 0 comments

Comments

@ricardojdsilva87
Copy link

ricardojdsilva87 commented May 20, 2024

Hello,

We get the following error when the datadog agent is trying to scrape the ArgoCD controller openmetrics endpoint.
image

Additional environment details (Operating System, Cloud provider, etc):

Steps to reproduce the issue:
Configuration on the pod using the official ArgoCD helm chart:

podAnnotations:
  ad.datadoghq.com/application-controller.checks: |
    {
      "argocd": {
        "instances": [
          {
            "app_controller_endpoint": "http://%%host%%:8082/metrics"
          }
        ]
      }
    }

We use the same configuration as described in the documentation for Datadog.

Describe the results you expected:
When using versions above 2.9.6 and below 2.11.0 of ArgoCD we get the following error shown above
image

I've tried to add the setting prometheus_timeout to the openmetrics configuration like described on the documentation:
https://docs.datadoghq.com/integrations/guide/prometheus-host-collection/

With the same configuration all the needed metrics are sent to Datadog and with the default ArgoCD dashboard it's possible to see them. After changing the ArgoCD version between v2.9.7 and 2.11.0, the error starts to appear and there are no metrics reaching Datadog.

The Datadog agent version is v7.53.0
Also after adding the prometheus_timeout to 30 the same error appears with the message that it timed out after 10s, seeming not to have any effect.
Is there something I'm missing? Also with the different versions of ArgoCD it wasn't supposed to stop sending metrics.
I'll be doing some more tests in order to try and check if any other middle version might work correctly

Fyi, found this issue while investigating another one with ArgoCD itself, more information can be found here

Thanks!

UPDATE
Hello,
Just to add some more information. It seems that the issue happens if the parameter controller.sharding.algorithm: "round-robin" documented here is added to the ArgoCD configuration.
I suppose that this different mode might generate alot more metrics than the legacy configuration and that might be causing the timeout after the 10 seconds.
If there is any setting to increase this timeout, I can try it out in the configuration.
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant