From ba3b258bd03f695bd9936c517766de47c331167d Mon Sep 17 00:00:00 2001
From: Matthieu Huin <mhuin@redhat.com>
Date: Fri, 29 Sep 2023 15:55:01 +0200
Subject: [PATCH] Monitoring: document feature

Add a monitoring page in the deployment documentation.

Change-Id: Iadf2d03957c24e29c6c8adb921af97cc924c4e64
---
 doc/deployment/index.md          |  1 +
 doc/deployment/monitoring.md     | 93 ++++++++++++++++++++++++++++++++
 doc/developer/getting_started.md | 12 +++--
 doc/operator/getting_started.md  |  5 ++
 4 files changed, 108 insertions(+), 3 deletions(-)
 create mode 100644 doc/deployment/monitoring.md

diff --git a/doc/deployment/index.md b/doc/deployment/index.md
index 83ce75a6..01a4322c 100644
--- a/doc/deployment/index.md
+++ b/doc/deployment/index.md
@@ -12,5 +12,6 @@ and managing a Software Factory Custom Resource through SF-Operator.
     1. [Zuul](./zuul.md)
     1. Logserver
 1. [Setting up certificates](./certificates.md)
+1. [Monitoring](./monitoring.md)
 1. [Deleting a deployment](./delete.md)
 1. [Custom Resource Definitions reference](./crds.md)
\ No newline at end of file
diff --git a/doc/deployment/monitoring.md b/doc/deployment/monitoring.md
new file mode 100644
index 00000000..be885182
--- /dev/null
+++ b/doc/deployment/monitoring.md
@@ -0,0 +1,93 @@
+# Monitoring
+
+Here you will find information about what monitoring is available on services deployed with SF-Operator.
+
+## Table of Contents
+
+1. [Concepts](#concepts)
+1. [Accessing the metrics](#accessing-the-metrics)
+1. [Statsd](#statsd)
+1. [Predefined alerts](#predefined-alerts)
+
+## Concepts
+
+SF-Operator use the [prometheus-operator](https://prometheus-operator.dev/) to expose and collect service metrics.
+SF-Operator will automatically create [PodMonitors](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api.md#podmonitor) for the following services:
+
+* Log Server
+* [Nodepool](./nodepool.md)
+* [Zuul](./zuul.md)
+
+| Service | Statsd metrics | Prometheus metrics |
+|---------|--------|-------|
+| Log Server | ❌ | ✅ |
+| Nodepool | ✅ | ✅ |
+| Zuul | ✅ | ✅ |
+
+The `PodMonitors` are set with the label key `sf-monitoring` (and a value equal to the monitored service name); that key can be used for filtering metrics.
+
+You can list the PodMonitors this way:
+
+```sh
+kubectl get podmonitors
+```
+
+The `Log server` service runs the [Node Exporter](https://prometheus.io/docs/guides/node-exporter/) process as a sidecar container as well, in order to expose disk space-related metrics.
+
+For services that expose statsd metrics, a sidecar container running [Statsd Exporter](https://github.com/prometheus/statsd_exporter)
+is added to the service pod, so that these metrics can be consumed by a Prometheus instance.
+
+## Accessing the metrics
+
+If [enabled in your cluster](https://docs.openshift.com/container-platform/4.13/monitoring/enabling-monitoring-for-user-defined-projects.html#enabling-monitoring-for-user-defined-projects), metrics will automatically
+be collected by the cluster-wide Prometheus instance. Check with your cluster admin about getting access to your metrics.
+
+If this feature isn't enabled in your cluster, you will need to deploy your own Prometheus instance to collect the metrics on your own.
+To do so, you can either:
+
+* Follow the [CLI documentation](./../cli/index.md#prometheus) to deploy a standalone Prometheus instance
+* Follow the [prometheus-operator's documentation](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/user-guides/getting-started.md#deploying-prometheus) to deploy it on your own
+
+In the latter case, you will need to set the proper `PodMonitorSelector` in the Prometheus instance's manifest:
+
+```yaml
+  # assuming Prometheus is deployed in the same namespace as SF
+  podMonitorNamespaceSelector: {}
+  podMonitorSelector:
+    matchExpressions:
+    - key: sf-monitoring
+      operator: Exists
+```
+
+## Statsd
+
+### Statsd Exporter mappings
+
+Statsd Exporter sidecars are preconfigured to map every statsd metric issued by Zuul and Nodepool into prometheus-compatible metrics.
+You can find the mappings definitions [here (Nodepool)](./../../controllers/static/nodepool/statsd_mapping.yaml) and [here (Zuul)](./../../controllers/static/zuul/statsd_mapping.yaml).
+
+### Forwarding
+
+It is possible to use the `relayAddress` property in a SoftwareFactory CRD to define a different statsd collector for Zuul and Nodepool, for example an external graphite instance.
+
+## Predefined alerts
+
+SF-Operator defines some metrics-related alert rules for the deployed services.
+
+> The alert rules are defined for Prometheus. Handling these alerts (typically sending out notifications) requires another service called [AlertManager](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/user-guides/alerting.md). How to manage AlertManager is out of scope for this documentation.
+You may need to [configure](https://docs.openshift.com/container-platform/4.13/monitoring/managing-alerts.html#sending-notifications-to-external-systems_managing-alerts) or
+[install](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/user-guides/alerting.md) an
+AlertManager instance on your cluster,
+or configure Prometheus to forward alerts to an external AlertManager instance.
+
+The following alerting rules are created automatically at deployment time:
+
+| Alert name | Severity | Service | Prometheus Group Rule | Description |
+|---------|------|------|--------|------------------|
+| `OutOfDiskNow` | critical | Log server | disk_default.rules | The Log server has less than 10% free storage space left |
+| `OutOfDiskInThreeDays` | warning | Log server | disk_default.rules | Assuming a linear trend, the Log server's storage space will fill up in less than three days |
+| `ConfigUpdateFailureInPostPipeline` | critical | Zuul | config-repository_default.rules | A `config-update` job failed in the `post` pipeline, meaning a configuration change was not applied properly to the Software Factory deployment's services |
+| `DIBImageBuildFailure` | warning | nodepool-builder | builder_default.rules | the disk-image-builder service (DIB) failed to build an image |
+| `HighOpenStackAPIError5xxRate` | critical | nodepool-launcher | launcher_default.rules | Triggers when more than 5% of API calls on an OpenStack provider return a status code of 5xx (server-side error) over a period of 15 minutes |
+| `HighFailedStateRate` | critical | nodepool-launcher | launcher_default.rules | Triggers when more than 5% of nodes on a provider are in failed state over a period of one hour |
+| `HighNodeLaunchErrorRate` | critical | nodepool-launcher | launcher_default.rules | Triggers when more than 5% of node launch events end in an error state over a period of one hour |
\ No newline at end of file
diff --git a/doc/developer/getting_started.md b/doc/developer/getting_started.md
index d8c99e4a..e562c702 100644
--- a/doc/developer/getting_started.md
+++ b/doc/developer/getting_started.md
@@ -46,6 +46,14 @@ You can read about [how to deploy a MicroShift instance here](./microshift.md).
 
 ## Deploy test resources
 
+With `sfconfig`, you can quickly deploy a demo deployment consisting of the following:
+
+* a SoftwareFactory resource (Zuul, Nodepool, Log server and backend services)
+* a companion Gerrit service hosting:
+    * the deployment's config repository
+    * a demo repository
+* a companion Prometheus instance for monitoring
+
 The operator will automatically use the current context in your kubeconfig file
 (i.e. whatever cluster `kubectl cluster-info` shows).
 Make sure that your current context is the right one for development. In this example, we are using
@@ -60,9 +68,7 @@ kubectl config set-context microshift --namespace=sf
 
 Edit the [sfconfig.yaml](./../../sfconfig.yaml) configuration file to your liking, for example by setting up a custom FQDN.
 
-Then run the `sfconfig` command to deploy a SoftwareFactory resource, a companion Gerrit service 
-preconfigured to host the deployment's config repository and a demo repository, and a companion
-Prometheus:
+Then run the `sfconfig` command:
 
 ```sh
 go run ./cli/sfconfig
diff --git a/doc/operator/getting_started.md b/doc/operator/getting_started.md
index 1187a7f3..a768dbb6 100644
--- a/doc/operator/getting_started.md
+++ b/doc/operator/getting_started.md
@@ -85,6 +85,11 @@ logservers.sf.softwarefactory-project.io
 softwarefactories.sf.softwarefactory-project.io
 ```
 
+Note that the SF-Operator OLM package depends on the following operators:
+
+* [cert-manager](https://cert-manager.io)
+* [prometheus-operator](https://prometheus-operator.dev)
+
 Congratulations, the SF Operator is now running on your cluster!
 
 ## Troubleshooting