You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
public.ecr.aws/datadog/agent:latest-jmx — version updates to latest every time the agent is booted, which is daily or more
Describe what happened:
Yesterday we found that Datadog Agent has been responsible for our near-daily system failures over the past 6 months, causing us untold amounts of engineering time and certainly losing us a large number of customers, as well as their trust.
We’ve repeatedly observed that seemingly random iowait spikes would spell certain death to our user-facing containers. First it would cause lockup of the ECS containers, and then lockup of the underlying EC2 machine, often requiring manual termination.
Before yesterday, we could not isolate the cause. We naively assumed it was our code, because we had no reason to suspect Datadog Agent, and furthermore, had no process-level visibility. We incorrectly assumed that our comprehensive host and JVM dashboards, along with logs and traces, would tell us all we needed to know.
Over the past few months we’ve worked to eliminate all possible causes of iowait within our user-facing containers, including nearly all disk usage. We transitioned from gp2 to gp3 disks and upgraded them to 500 MiB/s throughput and 5000 IOPS (far exceeding the previous configuration). The iowait problem continued to happen and our site continued to go down.
Once we turned on process-level visibility via Datadog Agent configuration yesterday, we realized that the iowait was caused by sudden massive (>6000 IOPS) disk reads on the part of Datadog Agent. We upped IOPS this morning to 10000, and even this ceiling is not high enough.
While Datadog Agent has been tremendously helpful to us, we consider this iowait issue a serious defect.
Thank you for reaching out and for providing such a detailed report. We understand how critical this issue is for your operations, and we sincerely regret the challenges it has caused. We truly appreciate the time and effort you’ve invested in troubleshooting and sharing your findings with us.
To help us investigate why the Agent is generating high I/O in your environment, we’ll need some additional information. Could you please contact Datadog support and provide:
This will allow us to better analyze the issue and attempt to reproduce it. Once you’ve reached out to support, please add a comment here so we can track the investigation.
Thanks again for your patience—we’re committed to helping you resolve this.
Agent Environment
public.ecr.aws/datadog/agent:latest-jmx — version updates to latest every time the agent is booted, which is daily or more
Describe what happened:
Yesterday we found that Datadog Agent has been responsible for our near-daily system failures over the past 6 months, causing us untold amounts of engineering time and certainly losing us a large number of customers, as well as their trust.
We’ve repeatedly observed that seemingly random iowait spikes would spell certain death to our user-facing containers. First it would cause lockup of the ECS containers, and then lockup of the underlying EC2 machine, often requiring manual termination.
Before yesterday, we could not isolate the cause. We naively assumed it was our code, because we had no reason to suspect Datadog Agent, and furthermore, had no process-level visibility. We incorrectly assumed that our comprehensive host and JVM dashboards, along with logs and traces, would tell us all we needed to know.
Over the past few months we’ve worked to eliminate all possible causes of iowait within our user-facing containers, including nearly all disk usage. We transitioned from gp2 to gp3 disks and upgraded them to 500 MiB/s throughput and 5000 IOPS (far exceeding the previous configuration). The iowait problem continued to happen and our site continued to go down.
Once we turned on process-level visibility via Datadog Agent configuration yesterday, we realized that the iowait was caused by sudden massive (>6000 IOPS) disk reads on the part of Datadog Agent. We upped IOPS this morning to 10000, and even this ceiling is not high enough.
While Datadog Agent has been tremendously helpful to us, we consider this iowait issue a serious defect.
How can we resolve this issue?
Thanks for your time.
Describe what you expected:
Steps to reproduce the issue:
Additional environment details (Operating System, Cloud provider, etc):
ECS-optimized AMI on EC2
The text was updated successfully, but these errors were encountered: