Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Datadog Agent causing massive iowait. Near-daily system failures over the past 6 months #33804

Open
alexandergunnarson opened this issue Feb 6, 2025 · 2 comments

Comments

@alexandergunnarson
Copy link

alexandergunnarson commented Feb 6, 2025

Agent Environment

public.ecr.aws/datadog/agent:latest-jmx — version updates to latest every time the agent is booted, which is daily or more

Describe what happened:

Yesterday we found that Datadog Agent has been responsible for our near-daily system failures over the past 6 months, causing us untold amounts of engineering time and certainly losing us a large number of customers, as well as their trust.

We’ve repeatedly observed that seemingly random iowait spikes would spell certain death to our user-facing containers. First it would cause lockup of the ECS containers, and then lockup of the underlying EC2 machine, often requiring manual termination.

Before yesterday, we could not isolate the cause. We naively assumed it was our code, because we had no reason to suspect Datadog Agent, and furthermore, had no process-level visibility. We incorrectly assumed that our comprehensive host and JVM dashboards, along with logs and traces, would tell us all we needed to know.

Over the past few months we’ve worked to eliminate all possible causes of iowait within our user-facing containers, including nearly all disk usage. We transitioned from gp2 to gp3 disks and upgraded them to 500 MiB/s throughput and 5000 IOPS (far exceeding the previous configuration). The iowait problem continued to happen and our site continued to go down.

Once we turned on process-level visibility via Datadog Agent configuration yesterday, we realized that the iowait was caused by sudden massive (>6000 IOPS) disk reads on the part of Datadog Agent. We upped IOPS this morning to 10000, and even this ceiling is not high enough.

While Datadog Agent has been tremendously helpful to us, we consider this iowait issue a serious defect.

How can we resolve this issue?

Thanks for your time.

Describe what you expected:

Steps to reproduce the issue:

Additional environment details (Operating System, Cloud provider, etc):

ECS-optimized AMI on EC2

@clamoriniere
Copy link
Contributor

Hi @alexandergunnarson,

Thank you for reaching out and for providing such a detailed report. We understand how critical this issue is for your operations, and we sincerely regret the challenges it has caused. We truly appreciate the time and effort you’ve invested in troubleshooting and sharing your findings with us.

To help us investigate why the Agent is generating high I/O in your environment, we’ll need some additional information. Could you please contact Datadog support and provide:

  • An Agent flare
  • The ECS Task definition used to deploy the Agent

This will allow us to better analyze the issue and attempt to reproduce it. Once you’ve reached out to support, please add a comment here so we can track the investigation.

Thanks again for your patience—we’re committed to helping you resolve this.

Best,
Cedric

@alexandergunnarson
Copy link
Author

Thanks @clamoriniere. Just provided that info in the email thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants