-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Adding new monitor template for ClickHouse #23027
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
sangeetashivaji
wants to merge
1
commit into
master
Choose a base branch
from
sangeeta.shivajirao/clickhouse-monitor-templates
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,32 @@ | ||
| { | ||
| "version": 2, | ||
| "created_at": "2026-03-23", | ||
| "last_updated_at": "2026-03-23", | ||
| "title": "ClickHouse cannot connect", | ||
| "tags": [ | ||
| "integration:clickhouse" | ||
| ], | ||
| "description": "The Datadog Agent is unable to connect to the monitored ClickHouse instance. This may indicate that ClickHouse is down, unreachable, or that the Agent's credentials are misconfigured.", | ||
| "definition": { | ||
| "message": "The Datadog Agent cannot connect to ClickHouse on {{host.name}}. Verify that the ClickHouse server is running and that the Agent configuration is correct.", | ||
| "name": "[ClickHouse] Cannot connect to {{host.name}}", | ||
| "options": { | ||
| "new_host_delay": 300, | ||
| "no_data_timeframe": 2, | ||
| "notify_audit": false, | ||
| "notify_no_data": false, | ||
| "renotify_interval": 0, | ||
| "thresholds": { | ||
| "critical": 1, | ||
| "ok": 1, | ||
| "warning": 1 | ||
| }, | ||
| "timeout_h": 0 | ||
| }, | ||
| "query": "\"clickhouse.can_connect\".over(\"*\").by(\"*\").last(2).count_by_status()", | ||
| "tags": [ | ||
| "integration:clickhouse" | ||
| ], | ||
| "type": "service check" | ||
| } | ||
| } |
36 changes: 36 additions & 0 deletions
36
clickhouse/assets/monitors/clickhouse_high_active_queries.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| { | ||
| "version": 2, | ||
| "created_at": "2026-03-23", | ||
| "last_updated_at": "2026-03-23", | ||
| "title": "ClickHouse active query count is high", | ||
| "tags": [ | ||
| "integration:clickhouse" | ||
| ], | ||
| "description": "A high number of simultaneously executing queries can saturate ClickHouse thread pools and degrade performance for all users. This monitor tracks the number of concurrently active queries.", | ||
| "definition": { | ||
| "message": "{{#is_alert}}\n\n## What's happening?\nClickHouse on {{host.name}} has a high number of concurrently active queries over the last 5 minutes. This may indicate query pile-up, long-running queries, or insufficient resources.\n\n## How to investigate\nCheck `system.processes` in ClickHouse for currently executing queries and identify any long-running or stuck queries.\n\n{{/is_alert}}", | ||
| "name": "[ClickHouse] High number of active queries on {{host.name}}", | ||
| "options": { | ||
| "escalation_message": "", | ||
| "include_tags": true, | ||
| "locked": false, | ||
| "new_host_delay": 300, | ||
| "no_data_timeframe": null, | ||
| "notify_audit": false, | ||
| "notify_no_data": false, | ||
| "renotify_interval": "0", | ||
| "require_full_window": true, | ||
| "thresholds": { | ||
| "critical": 200, | ||
| "warning": 100 | ||
| }, | ||
| "timeout_h": 0 | ||
| }, | ||
| "priority": null, | ||
| "query": "avg(last_5m):avg:clickhouse.query.active{*} > 200", | ||
| "tags": [ | ||
| "integration:clickhouse" | ||
| ], | ||
| "type": "query alert" | ||
| } | ||
| } | ||
36 changes: 36 additions & 0 deletions
36
clickhouse/assets/monitors/clickhouse_high_query_failure_rate.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| { | ||
| "version": 2, | ||
| "created_at": "2026-03-23", | ||
| "last_updated_at": "2026-03-23", | ||
| "title": "ClickHouse query failure rate is high", | ||
| "tags": [ | ||
| "integration:clickhouse" | ||
| ], | ||
| "description": "A high rate of failed queries in ClickHouse can indicate problematic queries, resource exhaustion, or misconfigured query limits. This monitor tracks the per-second rate of failed queries to catch degradation early.", | ||
| "definition": { | ||
| "message": "{{#is_alert}}\n\n## What's happening?\nClickHouse on {{host.name}} has a high query failure rate over the last 5 minutes.\n\n## How to investigate\nCheck the ClickHouse system log (`system.query_log`) for error details and identify the failing queries.\n\n{{/is_alert}}", | ||
| "name": "[ClickHouse] High query failure rate on {{host.name}}", | ||
| "options": { | ||
| "escalation_message": "", | ||
| "include_tags": true, | ||
| "locked": false, | ||
| "new_host_delay": 300, | ||
| "no_data_timeframe": null, | ||
| "notify_audit": false, | ||
| "notify_no_data": false, | ||
| "renotify_interval": "0", | ||
| "require_full_window": true, | ||
| "thresholds": { | ||
| "critical": 5, | ||
| "warning": 1 | ||
| }, | ||
| "timeout_h": 0 | ||
| }, | ||
| "priority": null, | ||
| "query": "avg(last_5m):avg:clickhouse.query.failed.count{*}.as_rate() > 5", | ||
| "tags": [ | ||
| "integration:clickhouse" | ||
| ], | ||
| "type": "query alert" | ||
| } | ||
| } |
36 changes: 36 additions & 0 deletions
36
clickhouse/assets/monitors/clickhouse_high_thread_cpu_wait.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| { | ||
| "version": 2, | ||
| "created_at": "2026-03-23", | ||
| "last_updated_at": "2026-03-23", | ||
| "title": "ClickHouse thread CPU scheduling wait is high", | ||
| "tags": [ | ||
| "integration:clickhouse" | ||
| ], | ||
| "description": "CPU scheduling wait measures the percentage of time a ClickHouse thread was ready to execute but waiting to be scheduled by the OS. High values indicate CPU contention — the server has more runnable threads than available CPU cores — which causes query latency to increase even when queries are not I/O bound.", | ||
| "definition": { | ||
| "message": "{{#is_alert}}\n\n## What's happening?\nClickHouse threads on {{host.name}} are spending a high percentage of time waiting for CPU scheduling. This indicates CPU saturation and will cause query latency to degrade.\n\n## How to investigate\nCheck overall host CPU utilization. Review concurrent query load via `system.processes`. Consider scaling up CPU resources or reducing query concurrency.\n\n{{/is_alert}}\n\n{{#is_warning}}\n\nClickHouse thread CPU wait on {{host.name}} is elevated. Monitor for further increase.\n\n{{/is_warning}}", | ||
| "name": "[ClickHouse] High thread CPU scheduling wait on {{host.name}}", | ||
| "options": { | ||
| "escalation_message": "", | ||
| "include_tags": true, | ||
| "locked": false, | ||
| "new_host_delay": 300, | ||
| "no_data_timeframe": null, | ||
| "notify_audit": false, | ||
| "notify_no_data": false, | ||
| "renotify_interval": "0", | ||
| "require_full_window": true, | ||
| "thresholds": { | ||
| "critical": 80, | ||
| "warning": 50 | ||
| }, | ||
| "timeout_h": 0 | ||
| }, | ||
| "priority": null, | ||
| "query": "avg(last_5m):avg:clickhouse.thread.cpu.wait{*} > 80", | ||
| "tags": [ | ||
| "integration:clickhouse" | ||
| ], | ||
| "type": "query alert" | ||
| } | ||
| } |
36 changes: 36 additions & 0 deletions
36
clickhouse/assets/monitors/clickhouse_merge_pool_saturation.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| { | ||
| "version": 2, | ||
| "created_at": "2026-03-23", | ||
| "last_updated_at": "2026-03-23", | ||
| "title": "ClickHouse background merge pool is saturated", | ||
| "tags": [ | ||
| "integration:clickhouse" | ||
| ], | ||
| "description": "ClickHouse uses background merge operations to combine data parts in MergeTree tables. When the merge pool is saturated, new merges cannot be scheduled, leading to an accumulation of small parts that degrades query performance and increases storage overhead. This monitor tracks the number of active background merge tasks.", | ||
| "definition": { | ||
| "message": "{{#is_alert}}\n\n## What's happening?\nThe ClickHouse background merge pool on {{host.name}} has a high number of active merge tasks. This may indicate write pressure exceeding the merge throughput, or merges being blocked by long-running operations.\n\n## How to investigate\nCheck `system.merges` for currently running merges. Consider reducing insert frequency, increasing `background_pool_size`, or investigating mutations blocking merges.\n\n{{/is_alert}}\n\n{{#is_warning}}\n\nThe ClickHouse background merge pool on {{host.name}} is becoming saturated. Monitor for further increase.\n\n{{/is_warning}}", | ||
| "name": "[ClickHouse] Background merge pool is saturated on {{host.name}}", | ||
| "options": { | ||
| "escalation_message": "", | ||
| "include_tags": true, | ||
| "locked": false, | ||
| "new_host_delay": 300, | ||
| "no_data_timeframe": null, | ||
| "notify_audit": false, | ||
| "notify_no_data": false, | ||
| "renotify_interval": "0", | ||
| "require_full_window": true, | ||
| "thresholds": { | ||
| "critical": 14, | ||
| "warning": 10 | ||
| }, | ||
| "timeout_h": 0 | ||
| }, | ||
| "priority": null, | ||
| "query": "avg(last_5m):avg:clickhouse.background_pool.merges.task.active{*} > 14", | ||
| "tags": [ | ||
| "integration:clickhouse" | ||
| ], | ||
| "type": "query alert" | ||
| } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,36 @@ | ||
| { | ||
| "version": 2, | ||
| "created_at": "2026-03-23", | ||
| "last_updated_at": "2026-03-23", | ||
| "title": "ClickHouse replica delay is high", | ||
| "tags": [ | ||
| "integration:clickhouse" | ||
| ], | ||
| "description": "Replica delay is the lag between when data is written to the primary shard and when it is replicated to replica nodes. High replica delay can lead to stale reads and indicate replication health issues. This monitor tracks the maximum absolute replica queue delay across replicated tables.", | ||
| "definition": { | ||
| "message": "{{#is_alert}}\n\n## What's happening?\nClickHouse replica delay on {{host.name}} has exceeded the critical threshold over the last 15 minutes. Replicas may be serving stale data.\n\n## How to investigate\nCheck `system.replicas` for tables with high `absolute_delay`. Look for network issues between replicas, or high write load on the primary shard.\n\n{{/is_alert}}\n\n{{#is_warning}}\n\nClickHouse replica delay on {{host.name}} is elevated. Monitor for further increase.\n\n{{/is_warning}}", | ||
| "name": "[ClickHouse] Replica delay is high on {{host.name}}", | ||
| "options": { | ||
| "escalation_message": "", | ||
| "include_tags": true, | ||
| "locked": false, | ||
| "new_host_delay": 300, | ||
| "no_data_timeframe": null, | ||
| "notify_audit": false, | ||
| "notify_no_data": false, | ||
| "renotify_interval": "0", | ||
| "require_full_window": true, | ||
| "thresholds": { | ||
| "critical": 300000, | ||
| "warning": 60000 | ||
| }, | ||
| "timeout_h": 0 | ||
| }, | ||
| "priority": null, | ||
| "query": "avg(last_15m):avg:clickhouse.replica.delay.absolute{*} > 300000", | ||
| "tags": [ | ||
| "integration:clickhouse" | ||
| ], | ||
| "type": "query alert" | ||
| } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This query creates a single aggregated alert across all hosts (
avg:...{*}withoutby {host}), but the monitor name/message uses{{host.name}}and describes host-level symptoms. In multi-host deployments, one overloaded node can be averaged out by healthy nodes, so the alert may never fire even when a specific host is unhealthy, and the host template variable may be empty in notifications. The same pattern appears in the other newly added query monitors (clickhouse_high_query_failure_rate.json,clickhouse_high_thread_cpu_wait.json,clickhouse_merge_pool_saturation.json, andclickhouse_replica_delay.json).Useful? React with 👍 / 👎.