Implemented Prometheus Rule for automated alerts (#193)

feat(cluster): Prometheus Rule for automated alerts + runbooks for a basic set of alerts * Renamed: `cluster.monitoring.enablePodMonitor` to `cluster.monitoring.podMonitor.enabled` * New configuration option: `cluster.monitoring.prometheusRule.enabled` defaults to `true` Signed-off-by: Itay Grudev <[email protected]> Signed-off-by: Gabriele Bartolini <[email protected]> Co-authored-by: Gabriele Bartolini <[email protected]>
cloudnative-pg · Mar 1, 2024 · b2088c4 · b2088c4
1 parent 001d787
commit b2088c4
Show file tree

Hide file tree

Showing 19 changed files with 908 additions and 33 deletions.
diff --git a/Makefile b/Makefile
@@ -12,15 +12,12 @@ docs: ## Generate charts' docs using helm-docs
 		(echo "Please, install https://github.com/norwoodj/helm-docs first" && exit 1)
 
 .PHONY: schema
-schema: ## Generate charts' schema usign helm schema-gen plugin
-	@helm schema-gen charts/cloudnative-pg/values.yaml > charts/cloudnative-pg/values.schema.json || \
-		(echo "Please, run: helm plugin install https://github.com/karuppiah7890/helm-schema-gen.git" && exit 1)
+schema: cloudnative-pg-schema cluster-schema ## Generate charts' schema using helm-schema-gen
 
-.PHONY: pgbench-deploy
-pgbench-deploy: ## Installs pgbench chart
-	helm dependency update charts/pgbench
-	helm upgrade --install pgbench --atomic charts/pgbench
+cloudnative-pg-schema:
+	@helm schema-gen charts/cloudnative-pg/values.yaml | cat > charts/cloudnative-pg/values.schema.json || \
+		(echo "Please, run: helm plugin install https://github.com/karuppiah7890/helm-schema-gen.git" && exit 1)
 
-.PHONY: pgbench-uninstall
-pgbench-uninstall: ## Uninstalls cnpg-pgbench chart if present
-	@helm uninstall pgbench
+cluster-schema:
+	@helm schema-gen charts/cluster/values.yaml | cat > charts/cluster/values.schema.json || \
+		(echo "Please, run: helm plugin install https://github.com/karuppiah7890/helm-schema-gen.git" && exit 1)
diff --git a/charts/cluster/README.md b/charts/cluster/README.md
@@ -88,9 +88,9 @@ Additionally you can specify the following parameters:
 ```yaml
 backups:
   scheduledBackups:
-	- name: daily-backup
-	  schedule: "0 0 0 * * *" # Daily at midnight
-	  backupOwnerReference: self
+    - name: daily-backup
+      schedule: "0 0 0 * * *" # Daily at midnight
+      backupOwnerReference: self
 ```
 
 Each backup adapter takes it's own set of parameters, listed in the [Configuration options](#Configuration-options) section
@@ -149,8 +149,10 @@ refer to  the [CloudNativePG Documentation](https://cloudnative-pg.io/documentat
 | cluster.instances | int | `3` | Number of instances |
 | cluster.logLevel | string | `"info"` | The instances' log level, one of the following values: error, warning, info (default), debug, trace |
 | cluster.monitoring.customQueries | list | `[]` |  |
-| cluster.monitoring.enablePodMonitor | bool | `false` |  |
-| cluster.postgresql | string | `nil` | Configuration of the PostgreSQL server See: https://cloudnative-pg.io/documentation/current/cloudnative-pg.v1/#postgresql-cnpg-io-v1-PostgresConfiguration |
+| cluster.monitoring.enabled | bool | `false` |  |
+| cluster.monitoring.podMonitor.enabled | bool | `true` |  |
+| cluster.monitoring.prometheusRule.enabled | bool | `true` |  |
+| cluster.postgresql | object | `{}` | Configuration of the PostgreSQL server See: https://cloudnative-pg.io/documentation/current/cloudnative-pg.v1/#postgresql-cnpg-io-v1-PostgresConfiguration |
 | cluster.primaryUpdateMethod | string | `"switchover"` | Method to follow to upgrade the primary server during a rolling update procedure, after all replicas have been successfully updated. It can be switchover (default) or in-place (restart). |
 | cluster.primaryUpdateStrategy | string | `"unsupervised"` | Strategy to follow to upgrade the primary server during a rolling update procedure, after all replicas have been successfully updated: it can be automated (unsupervised - default) or manual (supervised) |
 | cluster.priorityClassName | string | `""` |  |

diff --git a/charts/cluster/docs/runbooks/CNPGClusterHACritical.md b/charts/cluster/docs/runbooks/CNPGClusterHACritical.md
@@ -0,0 +1,49 @@
+CNPGClusterHACritical
+=====================
+
+Meaning
+-------
+
+The `CNPGClusterHACritical` alert is triggered when the CloudNativePG cluster has no ready standby replicas.
+
+This can happen during either a normal failover or automated minor version upgrades in a cluster with 2 or less
+instances. The replaced instance may need some time to catch-up with the cluster primary instance.
+
+This alarm will be always triggered if your cluster is configured to run with only 1 instance. In this case you
+may want to silence it.
+
+Impact
+------
+
+Having no available replicas puts your cluster at a severe risk if the primary instance fails. The primary instance is
+still online and able to serve queries, although connections to the `-ro` endpoint will fail.
+
+Diagnosis
+---------
+
+Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).
+
+Get the status of the CloudNativePG cluster instances:
+
+```bash
+kubectl get pods -A -l "cnpg.io/podRole=instance" -o wide
+```
+
+Check the logs of the affected CloudNativePG instances:
+
+```bash
+kubectl logs --namespace <namespace> pod/<instance-pod-name>
+```
+
+Check the CloudNativePG operator logs:
+
+```bash
+kubectl logs --namespace cnpg-system -l "app.kubernetes.io/name=cloudnative-pg"
+```
+
+Mitigation
+----------
+
+Refer to the [CloudNativePG Failure Modes](https://cloudnative-pg.io/documentation/current/failure_modes/)
+and [CloudNativePG Troubleshooting](https://cloudnative-pg.io/documentation/current/troubleshooting/) documentation for
+more information on how to troubleshoot and mitigate this issue.
diff --git a/charts/cluster/docs/runbooks/CNPGClusterHAWarning.md b/charts/cluster/docs/runbooks/CNPGClusterHAWarning.md
@@ -0,0 +1,51 @@
+CNPGClusterHAWarning
+====================
+
+Meaning
+-------
+
+The `CNPGClusterHAWarning` alert is triggered when the CloudNativePG cluster ready standby replicas are less than `2`.
+
+This alarm will be always triggered if your cluster is configured to run with less than `3` instances. In this case you
+may want to silence it.
+
+Impact
+------
+
+Having less than two available replicas puts your cluster at risk if another instance fails. The cluster is still able
+to operate normally, although the `-ro` and `-r` endpoints operate at reduced capacity.
+
+This can happen during a normal failover or automated minor version upgrades. The replaced instance may need some time
+to catch-up with the cluster primary instance which will trigger the alert if the operation takes more than 5 minutes.
+
+At `0` available ready replicas, a `CNPGClusterHACritical` alert will be triggered.
+
+Diagnosis
+---------
+
+Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).
+
+Get the status of the CloudNativePG cluster instances:
+
+```bash
+kubectl get pods -A -l "cnpg.io/podRole=instance" -o wide
+```
+
+Check the logs of the affected CloudNativePG instances:
+
+```bash
+kubectl logs --namespace <namespace> pod/<instance-pod-name>
+```
+
+Check the CloudNativePG operator logs:
+
+```bash
+kubectl logs --namespace cnpg-system -l "app.kubernetes.io/name=cloudnative-pg"
+```
+
+Mitigation
+----------
+
+Refer to the [CloudNativePG Failure Modes](https://cloudnative-pg.io/documentation/current/failure_modes/)
+and [CloudNativePG Troubleshooting](https://cloudnative-pg.io/documentation/current/troubleshooting/) documentation for
+more information on how to troubleshoot and mitigate this issue.
diff --git a/charts/cluster/docs/runbooks/CNPGClusterHighConnectionsCritical.md b/charts/cluster/docs/runbooks/CNPGClusterHighConnectionsCritical.md
@@ -0,0 +1,24 @@
+CNPGClusterHighConnectionsCritical
+==================================
+
+Meaning
+-------
+
+This alert is triggered when the number of connections to the CloudNativePG cluster instance exceeds 95% of its capacity.
+
+Impact
+------
+
+At 100% capacity, the CloudNativePG cluster instance will not be able to accept new connections. This will result in a service
+disruption.
+
+Diagnosis
+---------
+
+Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).
+
+Mitigation
+----------
+
+* Increase the maximum number of connections by increasing the `max_connections` PostgreSQL parameter.
+* Use connection pooling by enabling PgBouncer to reduce the number of connections to the database.
diff --git a/charts/cluster/docs/runbooks/CNPGClusterHighConnectionsWarning.md b/charts/cluster/docs/runbooks/CNPGClusterHighConnectionsWarning.md
@@ -0,0 +1,24 @@
+CNPGClusterHighConnectionsWarning
+=================================
+
+Meaning
+-------
+
+This alert is triggered when the number of connections to the CloudNativePG cluster instance exceeds 85% of its capacity.
+
+Impact
+------
+
+At 100% capacity, the CloudNativePG cluster instance will not be able to accept new connections. This will result in a service
+disruption.
+
+Diagnosis
+---------
+
+Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).
+
+Mitigation
+----------
+
+* Increase the maximum number of connections by increasing the `max_connections` PostgreSQL parameter.
+* Use connection pooling by enabling PgBouncer to reduce the number of connections to the database.
diff --git a/charts/cluster/docs/runbooks/CNPGClusterHighReplicationLag.md b/charts/cluster/docs/runbooks/CNPGClusterHighReplicationLag.md
@@ -0,0 +1,31 @@
+CNPGClusterHighReplicationLag
+=============================
+
+Meaning
+-------
+
+This alert is triggered when the replication lag of the CloudNativePG cluster exceed `1s`.
+
+Impact
+------
+
+High replication lag can cause the cluster replicas become out of sync. Queries to the `-r` and `-ro` endpoints may return stale data.
+In the event of a failover, there may be data loss for the time period of the lag.
+
+Diagnosis
+---------
+
+Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).
+
+High replication lag can be caused by a number of factors, including:
+* Network issues
+* High load on the primary or replicas
+* Long running queries
+* Suboptimal PostgreSQL configuration, in particular small numbers of `max_wal_senders`.
+
+```yaml
+kubectl exec --namespace <namespace> --stdin --tty services/<cluster_name>-rw -- psql -c "SELECT * from pg_stat_replication;"
+```
+
+Mitigation
+----------
diff --git a/charts/cluster/docs/runbooks/CNPGClusterInstancesOnSameNode.md b/charts/cluster/docs/runbooks/CNPGClusterInstancesOnSameNode.md
@@ -0,0 +1,28 @@
+CNPGClusterInstancesOnSameNode
+============================
+
+Meaning
+-------
+
+The `CNPGClusterInstancesOnSameNode` alert is raised when two or more database pods are scheduled on the same node.
+
+Impact
+------
+
+A failure or scheduled downtime of a single node will lead to a potential service disruption and/or data loss.
+
+Diagnosis
+---------
+
+Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).
+
+```bash
+kubectl get pods -A -l "cnpg.io/podRole=instance" -o wide
+```
+
+Mitigation
+----------
+
+1. Verify you have more than a single node with no taints, preventing pods to be scheduled there.
+2. Verify your [affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/) configuration.
+3. For more information, please refer to the ["Scheduling"](https://cloudnative-pg.io/documentation/current/scheduling/) section in the documentation
diff --git a/charts/cluster/docs/runbooks/CNPGClusterLowDiskSpaceCritical.md b/charts/cluster/docs/runbooks/CNPGClusterLowDiskSpaceCritical.md
@@ -0,0 +1,31 @@
+CNPGClusterLowDiskSpaceCritical
+===============================
+
+Meaning
+-------
+
+This alert is triggered when the disk space on the CloudNativePG cluster exceeds 90%. It can be triggered by either:
+
+* the PVC hosting the `PGDATA` (`storage` section)
+* the PVC hosting WAL files (`walStorage` section), where applicable
+* any PVC hosting a tablespace (`tablespaces` section)
+
+Impact
+------
+
+Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).
+
+Excessive disk space usage can lead fragmentation negatively impacting performance. Reaching 100% disk usage will result
+in downtime and data loss.
+
+Diagnosis
+---------
+
+Mitigation
+----------
+
+If you experience issues with the WAL (Write-Ahead Logging) volume and have
+set up continuous archiving, ensure that WAL archiving is functioning
+correctly. This is crucial to avoid a buildup of WAL files in the `pg_wal`
+folder. Monitor the `cnpg_collector_pg_wal_archive_status` metric, specifically
+ensuring that the number of `ready` files does not increase linearly.
diff --git a/charts/cluster/docs/runbooks/CNPGClusterLowDiskSpaceWarning.md b/charts/cluster/docs/runbooks/CNPGClusterLowDiskSpaceWarning.md
@@ -0,0 +1,31 @@
+CNPGClusterLowDiskSpaceWarning
+==============================
+
+Meaning
+-------
+
+This alert is triggered when the disk space on the CloudNativePG cluster exceeds 90%. It can be triggered by either:
+
+* the PVC hosting the `PGDATA` (`storage` section)
+* the PVC hosting WAL files (`walStorage` section), where applicable
+* any PVC hosting a tablespace (`tablespaces` section)
+
+Impact
+------
+
+Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).
+
+Excessive disk space usage can lead fragmentation negatively impacting performance. Reaching 100% disk usage will result
+in downtime and data loss.
+
+Diagnosis
+---------
+
+Mitigation
+----------
+
+If you experience issues with the WAL (Write-Ahead Logging) volume and have
+set up continuous archiving, ensure that WAL archiving is functioning
+correctly. This is crucial to avoid a buildup of WAL files in the `pg_wal`
+folder. Monitor the `cnpg_collector_pg_wal_archive_status` metric, specifically
+ensuring that the number of `ready` files does not increase linearly.
diff --git a/charts/cluster/docs/runbooks/CNPGClusterOffline.md b/charts/cluster/docs/runbooks/CNPGClusterOffline.md
@@ -0,0 +1,43 @@
+CNPGClusterOffline
+==================
+
+Meaning
+-------
+
+The `CNPGClusterOffline` alert is triggered when there are no ready CloudNativePG instances.
+
+Impact
+------
+
+Having an offline cluster means your applications will not be able to access the database, leading to potential service
+disruption.
+
+Diagnosis
+---------
+
+Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).
+
+Get the status of the CloudNativePG cluster instances:
+
+```bash
+kubectl get pods -A -l "cnpg.io/podRole=instance" -o wide
+```
+
+Check the logs of the affected CloudNativePG instances:
+
+```bash
+kubectl logs --namespace <namespace> pod/<instance-pod-name>
+```
+
+Check the CloudNativePG operator logs:
+
+```bash
+kubectl logs --namespace cnpg-system -l "app.kubernetes.io/name=cloudnative-pg"
+```
+
+Mitigation
+----------
+
+Refer to the [CloudNativePG Failure Modes](https://cloudnative-pg.io/documentation/current/failure_modes/)
+and [CloudNativePG Troubleshooting](https://cloudnative-pg.io/documentation/current/troubleshooting/) documentation for
+more information on how to troubleshoot and mitigate this issue.