[BUG] Datadog Agent causing massive iowait. Near-daily system failures over the past 6 months #33804

alexandergunnarson · 2025-02-06T20:48:05Z

Agent Environment

public.ecr.aws/datadog/agent:latest-jmx — version updates to latest every time the agent is booted, which is daily or more

Describe what happened:

Yesterday we found that Datadog Agent has been responsible for our near-daily system failures over the past 6 months, causing us untold amounts of engineering time and certainly losing us a large number of customers, as well as their trust.

We’ve repeatedly observed that seemingly random iowait spikes would spell certain death to our user-facing containers. First it would cause lockup of the ECS containers, and then lockup of the underlying EC2 machine, often requiring manual termination.

Before yesterday, we could not isolate the cause. We naively assumed it was our code, because we had no reason to suspect Datadog Agent, and furthermore, had no process-level visibility. We incorrectly assumed that our comprehensive host and JVM dashboards, along with logs and traces, would tell us all we needed to know.

Over the past few months we’ve worked to eliminate all possible causes of iowait within our user-facing containers, including nearly all disk usage. We transitioned from gp2 to gp3 disks and upgraded them to 500 MiB/s throughput and 5000 IOPS (far exceeding the previous configuration). The iowait problem continued to happen and our site continued to go down.

Once we turned on process-level visibility via Datadog Agent configuration yesterday, we realized that the iowait was caused by sudden massive (>6000 IOPS) disk reads on the part of Datadog Agent. We upped IOPS this morning to 10000, and even this ceiling is not high enough.

While Datadog Agent has been tremendously helpful to us, we consider this iowait issue a serious defect.

How can we resolve this issue?

Thanks for your time.

Describe what you expected:

Steps to reproduce the issue:

Additional environment details (Operating System, Cloud provider, etc):

ECS-optimized AMI on EC2

clamoriniere · 2025-02-07T09:05:12Z

Hi @alexandergunnarson,

Thank you for reaching out and for providing such a detailed report. We understand how critical this issue is for your operations, and we sincerely regret the challenges it has caused. We truly appreciate the time and effort you’ve invested in troubleshooting and sharing your findings with us.

To help us investigate why the Agent is generating high I/O in your environment, we’ll need some additional information. Could you please contact Datadog support and provide:

An Agent flare
The ECS Task definition used to deploy the Agent

This will allow us to better analyze the issue and attempt to reproduce it. Once you’ve reached out to support, please add a comment here so we can track the investigation.

Thanks again for your patience—we’re committed to helping you resolve this.

Best,
Cedric

alexandergunnarson · 2025-02-07T15:15:41Z

Thanks @clamoriniere. Just provided that info in the email thread.

alexandergunnarson added the team/triage label Feb 6, 2025

github-actions bot added the team/container-integrations label Feb 6, 2025

clamoriniere added the kind/bug label Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Datadog Agent causing massive iowait. Near-daily system failures over the past 6 months #33804

[BUG] Datadog Agent causing massive iowait. Near-daily system failures over the past 6 months #33804

alexandergunnarson commented Feb 6, 2025 •

edited

Loading

clamoriniere commented Feb 7, 2025

alexandergunnarson commented Feb 7, 2025

[BUG] Datadog Agent causing massive iowait. Near-daily system failures over the past 6 months #33804

[BUG] Datadog Agent causing massive iowait. Near-daily system failures over the past 6 months #33804

Comments

alexandergunnarson commented Feb 6, 2025 • edited Loading

clamoriniere commented Feb 7, 2025

alexandergunnarson commented Feb 7, 2025

alexandergunnarson commented Feb 6, 2025 •

edited

Loading