You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The kafka_consumer datadog module causes a large increase in network in and out of the instances. We identified it as the cause of the network traffic by disabling the integration and restarting the datadog agent. With network diagnostic tool ‘nethogs’ we can see the datadog agent process is receiving large amount of data (up to +500MiB/s). We have been unable to reproduce the issue in our lab cluster, although that could be due to lack of testing data / consumer groups / general load.
We’ve encountered the issue on these versions
We are running confluent kafka 7.7.1 (Apache Kafka® 3.7)
Datadog agent_version: 7.57.2
kafka_consumer (4.6.0)
We’ve tried updating to lastest versions, still seeing the same issue.
Datadog agent_version: 7.58.2
kafka_consumer (5.0.0)
I’ve included some examples of the kafka_consumer output from datadog-agent status for instances where the traffic is abnormally high:
Cluster1:
kafka_consumer (4.6.0)
----------------------
Instance ID: kafka_consumer:5835d1972801c612 [OK]
Configuration Source: file:/etc/datadog-agent/conf.d/kafka_consumer.d/conf.yaml
Total Runs: 123
Metric Samples: Last Run: 40,000, Total: 4,873,019
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 0, Total: 0
Average Execution Time : 7.449s
Last Execution Date : 2024-11-03 15:24:46 UTC (1730647486000)
Last Successful Execution Date : 2024-11-03 15:24:46 UTC (1730647486000)
Cluster2:
kafka_consumer (5.0.0)
----------------------
Instance ID: kafka_consumer:98057d281cbc9d57 [WARNING]
Configuration Source: file:/etc/datadog-agent/conf.d/kafka_consumer.d/conf.yaml
Total Runs: 1
Metric Samples: Last Run: 50,000, Total: 50,000
Events: Last Run: 0, Total: 0
Service Checks: Last Run: 0, Total: 0
Average Execution Time : 32.77s
Last Execution Date : 2024-11-05 13:16:57 UTC (1730812617000)
Last Successful Execution Date : 2024-11-05 13:16:57 UTC (1730812617000)
Warning: Context limit reached. Skipping highwater offset collection.
Warning: Discovered 75257 metric contexts - this exceeds the maximum number of 50000 contexts permitted by the
check. Please narrow your target by specifying in your kafka_consumer.yaml the consumer groups, topics
and partitions you wish to monitor.
The issue here is related to the kafka_consumer data volume, you are collecting every possible consumer group exposed using just one dd agent, there are over 75000 metrics sent from this check so the max_partition_contexts has to be further increased. But that would mean a higher resource consumption for the agent and a higher network traffic.
One way to reduce this, is to use specific consumer group names (regex or exact match).
We didnt see this issue in the affected cluster prior to upgrading Confluent Kafka 7.3 -> Confluent Kafka 7.7.1.
The kafka_consumer.yaml remains unchanged as well as the amount of metrics.
There could be an update within kafka that affected the consumer groups metrics and the collection networking usage of the metrics. The integration increased by more than 10x after 7.7.1. But I have not been able to identify what kind of change would be relevant.
Hi,
We recently encountered an issue with one of our customers kafka-clusters relating to the Datadog kafka_consumer integration (https://github.com/DataDog/integrations-core/tree/master/kafka_consumer)
The kafka_consumer datadog module causes a large increase in network in and out of the instances. We identified it as the cause of the network traffic by disabling the integration and restarting the datadog agent. With network diagnostic tool ‘nethogs’ we can see the datadog agent process is receiving large amount of data (up to +500MiB/s). We have been unable to reproduce the issue in our lab cluster, although that could be due to lack of testing data / consumer groups / general load.
We’ve encountered the issue on these versions
We are running confluent kafka 7.7.1 (Apache Kafka® 3.7)
Datadog agent_version: 7.57.2
kafka_consumer (4.6.0)
We’ve tried updating to lastest versions, still seeing the same issue.
Datadog agent_version: 7.58.2
kafka_consumer (5.0.0)
I’ve included some examples of the kafka_consumer output from datadog-agent status for instances where the traffic is abnormally high:
Cluster1:
Cluster2:
/etc/datadog-agent/conf.d/kafka_consumer.d/conf.yaml:
The text was updated successfully, but these errors were encountered: