Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions clickhouse/assets/monitors/clickhouse_can_connect.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
{
"version": 2,
"created_at": "2026-03-23",
"last_updated_at": "2026-03-23",
"title": "ClickHouse cannot connect",
"tags": [
"integration:clickhouse"
],
"description": "The Datadog Agent is unable to connect to the monitored ClickHouse instance. This may indicate that ClickHouse is down, unreachable, or that the Agent's credentials are misconfigured.",
"definition": {
"message": "The Datadog Agent cannot connect to ClickHouse on {{host.name}}. Verify that the ClickHouse server is running and that the Agent configuration is correct.",
"name": "[ClickHouse] Cannot connect to {{host.name}}",
"options": {
"new_host_delay": 300,
"no_data_timeframe": 2,
"notify_audit": false,
"notify_no_data": false,
"renotify_interval": 0,
"thresholds": {
"critical": 1,
"ok": 1,
"warning": 1
},
"timeout_h": 0
},
"query": "\"clickhouse.can_connect\".over(\"*\").by(\"*\").last(2).count_by_status()",
"tags": [
"integration:clickhouse"
],
"type": "service check"
}
}
36 changes: 36 additions & 0 deletions clickhouse/assets/monitors/clickhouse_high_active_queries.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
{
"version": 2,
"created_at": "2026-03-23",
"last_updated_at": "2026-03-23",
"title": "ClickHouse active query count is high",
"tags": [
"integration:clickhouse"
],
"description": "A high number of simultaneously executing queries can saturate ClickHouse thread pools and degrade performance for all users. This monitor tracks the number of concurrently active queries.",
"definition": {
"message": "{{#is_alert}}\n\n## What's happening?\nClickHouse on {{host.name}} has a high number of concurrently active queries over the last 5 minutes. This may indicate query pile-up, long-running queries, or insufficient resources.\n\n## How to investigate\nCheck `system.processes` in ClickHouse for currently executing queries and identify any long-running or stuck queries.\n\n{{/is_alert}}",
"name": "[ClickHouse] High number of active queries on {{host.name}}",
"options": {
"escalation_message": "",
"include_tags": true,
"locked": false,
"new_host_delay": 300,
"no_data_timeframe": null,
"notify_audit": false,
"notify_no_data": false,
"renotify_interval": "0",
"require_full_window": true,
"thresholds": {
"critical": 200,
"warning": 100
},
"timeout_h": 0
},
"priority": null,
"query": "avg(last_5m):avg:clickhouse.query.active{*} > 200",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Scope these ClickHouse alerts per host

This query creates a single aggregated alert across all hosts (avg:...{*} without by {host}), but the monitor name/message uses {{host.name}} and describes host-level symptoms. In multi-host deployments, one overloaded node can be averaged out by healthy nodes, so the alert may never fire even when a specific host is unhealthy, and the host template variable may be empty in notifications. The same pattern appears in the other newly added query monitors (clickhouse_high_query_failure_rate.json, clickhouse_high_thread_cpu_wait.json, clickhouse_merge_pool_saturation.json, and clickhouse_replica_delay.json).

Useful? React with 👍 / 👎.

"tags": [
"integration:clickhouse"
],
"type": "query alert"
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
{
"version": 2,
"created_at": "2026-03-23",
"last_updated_at": "2026-03-23",
"title": "ClickHouse query failure rate is high",
"tags": [
"integration:clickhouse"
],
"description": "A high rate of failed queries in ClickHouse can indicate problematic queries, resource exhaustion, or misconfigured query limits. This monitor tracks the per-second rate of failed queries to catch degradation early.",
"definition": {
"message": "{{#is_alert}}\n\n## What's happening?\nClickHouse on {{host.name}} has a high query failure rate over the last 5 minutes.\n\n## How to investigate\nCheck the ClickHouse system log (`system.query_log`) for error details and identify the failing queries.\n\n{{/is_alert}}",
"name": "[ClickHouse] High query failure rate on {{host.name}}",
"options": {
"escalation_message": "",
"include_tags": true,
"locked": false,
"new_host_delay": 300,
"no_data_timeframe": null,
"notify_audit": false,
"notify_no_data": false,
"renotify_interval": "0",
"require_full_window": true,
"thresholds": {
"critical": 5,
"warning": 1
},
"timeout_h": 0
},
"priority": null,
"query": "avg(last_5m):avg:clickhouse.query.failed.count{*}.as_rate() > 5",
"tags": [
"integration:clickhouse"
],
"type": "query alert"
}
}
36 changes: 36 additions & 0 deletions clickhouse/assets/monitors/clickhouse_high_thread_cpu_wait.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
{
"version": 2,
"created_at": "2026-03-23",
"last_updated_at": "2026-03-23",
"title": "ClickHouse thread CPU scheduling wait is high",
"tags": [
"integration:clickhouse"
],
"description": "CPU scheduling wait measures the percentage of time a ClickHouse thread was ready to execute but waiting to be scheduled by the OS. High values indicate CPU contention — the server has more runnable threads than available CPU cores — which causes query latency to increase even when queries are not I/O bound.",
"definition": {
"message": "{{#is_alert}}\n\n## What's happening?\nClickHouse threads on {{host.name}} are spending a high percentage of time waiting for CPU scheduling. This indicates CPU saturation and will cause query latency to degrade.\n\n## How to investigate\nCheck overall host CPU utilization. Review concurrent query load via `system.processes`. Consider scaling up CPU resources or reducing query concurrency.\n\n{{/is_alert}}\n\n{{#is_warning}}\n\nClickHouse thread CPU wait on {{host.name}} is elevated. Monitor for further increase.\n\n{{/is_warning}}",
"name": "[ClickHouse] High thread CPU scheduling wait on {{host.name}}",
"options": {
"escalation_message": "",
"include_tags": true,
"locked": false,
"new_host_delay": 300,
"no_data_timeframe": null,
"notify_audit": false,
"notify_no_data": false,
"renotify_interval": "0",
"require_full_window": true,
"thresholds": {
"critical": 80,
"warning": 50
},
"timeout_h": 0
},
"priority": null,
"query": "avg(last_5m):avg:clickhouse.thread.cpu.wait{*} > 80",
"tags": [
"integration:clickhouse"
],
"type": "query alert"
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
{
"version": 2,
"created_at": "2026-03-23",
"last_updated_at": "2026-03-23",
"title": "ClickHouse background merge pool is saturated",
"tags": [
"integration:clickhouse"
],
"description": "ClickHouse uses background merge operations to combine data parts in MergeTree tables. When the merge pool is saturated, new merges cannot be scheduled, leading to an accumulation of small parts that degrades query performance and increases storage overhead. This monitor tracks the number of active background merge tasks.",
"definition": {
"message": "{{#is_alert}}\n\n## What's happening?\nThe ClickHouse background merge pool on {{host.name}} has a high number of active merge tasks. This may indicate write pressure exceeding the merge throughput, or merges being blocked by long-running operations.\n\n## How to investigate\nCheck `system.merges` for currently running merges. Consider reducing insert frequency, increasing `background_pool_size`, or investigating mutations blocking merges.\n\n{{/is_alert}}\n\n{{#is_warning}}\n\nThe ClickHouse background merge pool on {{host.name}} is becoming saturated. Monitor for further increase.\n\n{{/is_warning}}",
"name": "[ClickHouse] Background merge pool is saturated on {{host.name}}",
"options": {
"escalation_message": "",
"include_tags": true,
"locked": false,
"new_host_delay": 300,
"no_data_timeframe": null,
"notify_audit": false,
"notify_no_data": false,
"renotify_interval": "0",
"require_full_window": true,
"thresholds": {
"critical": 14,
"warning": 10
},
"timeout_h": 0
},
"priority": null,
"query": "avg(last_5m):avg:clickhouse.background_pool.merges.task.active{*} > 14",
"tags": [
"integration:clickhouse"
],
"type": "query alert"
}
}
36 changes: 36 additions & 0 deletions clickhouse/assets/monitors/clickhouse_replica_delay.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
{
"version": 2,
"created_at": "2026-03-23",
"last_updated_at": "2026-03-23",
"title": "ClickHouse replica delay is high",
"tags": [
"integration:clickhouse"
],
"description": "Replica delay is the lag between when data is written to the primary shard and when it is replicated to replica nodes. High replica delay can lead to stale reads and indicate replication health issues. This monitor tracks the maximum absolute replica queue delay across replicated tables.",
"definition": {
"message": "{{#is_alert}}\n\n## What's happening?\nClickHouse replica delay on {{host.name}} has exceeded the critical threshold over the last 15 minutes. Replicas may be serving stale data.\n\n## How to investigate\nCheck `system.replicas` for tables with high `absolute_delay`. Look for network issues between replicas, or high write load on the primary shard.\n\n{{/is_alert}}\n\n{{#is_warning}}\n\nClickHouse replica delay on {{host.name}} is elevated. Monitor for further increase.\n\n{{/is_warning}}",
"name": "[ClickHouse] Replica delay is high on {{host.name}}",
"options": {
"escalation_message": "",
"include_tags": true,
"locked": false,
"new_host_delay": 300,
"no_data_timeframe": null,
"notify_audit": false,
"notify_no_data": false,
"renotify_interval": "0",
"require_full_window": true,
"thresholds": {
"critical": 300000,
"warning": 60000
},
"timeout_h": 0
},
"priority": null,
"query": "avg(last_15m):avg:clickhouse.replica.delay.absolute{*} > 300000",
"tags": [
"integration:clickhouse"
],
"type": "query alert"
}
}
8 changes: 8 additions & 0 deletions clickhouse/manifest.json
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,14 @@
},
"dashboards": {
"ClickHouse Overview": "assets/dashboards/overview.json"
},
"monitors": {
"ClickHouse cannot connect": "assets/monitors/clickhouse_can_connect.json",
"ClickHouse query failure rate is high": "assets/monitors/clickhouse_high_query_failure_rate.json",
"ClickHouse active query count is high": "assets/monitors/clickhouse_high_active_queries.json",
"ClickHouse replica delay is high": "assets/monitors/clickhouse_replica_delay.json",
"ClickHouse background merge pool is saturated": "assets/monitors/clickhouse_merge_pool_saturation.json",
"ClickHouse thread CPU scheduling wait is high": "assets/monitors/clickhouse_high_thread_cpu_wait.json"
}
}
}
Loading