From ba3b258bd03f695bd9936c517766de47c331167d Mon Sep 17 00:00:00 2001 From: Matthieu Huin Date: Fri, 29 Sep 2023 15:55:01 +0200 Subject: [PATCH] Monitoring: document feature Add a monitoring page in the deployment documentation. Change-Id: Iadf2d03957c24e29c6c8adb921af97cc924c4e64 --- doc/deployment/index.md | 1 + doc/deployment/monitoring.md | 93 ++++++++++++++++++++++++++++++++ doc/developer/getting_started.md | 12 +++-- doc/operator/getting_started.md | 5 ++ 4 files changed, 108 insertions(+), 3 deletions(-) create mode 100644 doc/deployment/monitoring.md diff --git a/doc/deployment/index.md b/doc/deployment/index.md index 83ce75a6..01a4322c 100644 --- a/doc/deployment/index.md +++ b/doc/deployment/index.md @@ -12,5 +12,6 @@ and managing a Software Factory Custom Resource through SF-Operator. 1. [Zuul](./zuul.md) 1. Logserver 1. [Setting up certificates](./certificates.md) +1. [Monitoring](./monitoring.md) 1. [Deleting a deployment](./delete.md) 1. [Custom Resource Definitions reference](./crds.md) \ No newline at end of file diff --git a/doc/deployment/monitoring.md b/doc/deployment/monitoring.md new file mode 100644 index 00000000..be885182 --- /dev/null +++ b/doc/deployment/monitoring.md @@ -0,0 +1,93 @@ +# Monitoring + +Here you will find information about what monitoring is available on services deployed with SF-Operator. + +## Table of Contents + +1. [Concepts](#concepts) +1. [Accessing the metrics](#accessing-the-metrics) +1. [Statsd](#statsd) +1. [Predefined alerts](#predefined-alerts) + +## Concepts + +SF-Operator use the [prometheus-operator](https://prometheus-operator.dev/) to expose and collect service metrics. +SF-Operator will automatically create [PodMonitors](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api.md#podmonitor) for the following services: + +* Log Server +* [Nodepool](./nodepool.md) +* [Zuul](./zuul.md) + +| Service | Statsd metrics | Prometheus metrics | +|---------|--------|-------| +| Log Server | ❌ | ✅ | +| Nodepool | ✅ | ✅ | +| Zuul | ✅ | ✅ | + +The `PodMonitors` are set with the label key `sf-monitoring` (and a value equal to the monitored service name); that key can be used for filtering metrics. + +You can list the PodMonitors this way: + +```sh +kubectl get podmonitors +``` + +The `Log server` service runs the [Node Exporter](https://prometheus.io/docs/guides/node-exporter/) process as a sidecar container as well, in order to expose disk space-related metrics. + +For services that expose statsd metrics, a sidecar container running [Statsd Exporter](https://github.com/prometheus/statsd_exporter) +is added to the service pod, so that these metrics can be consumed by a Prometheus instance. + +## Accessing the metrics + +If [enabled in your cluster](https://docs.openshift.com/container-platform/4.13/monitoring/enabling-monitoring-for-user-defined-projects.html#enabling-monitoring-for-user-defined-projects), metrics will automatically +be collected by the cluster-wide Prometheus instance. Check with your cluster admin about getting access to your metrics. + +If this feature isn't enabled in your cluster, you will need to deploy your own Prometheus instance to collect the metrics on your own. +To do so, you can either: + +* Follow the [CLI documentation](./../cli/index.md#prometheus) to deploy a standalone Prometheus instance +* Follow the [prometheus-operator's documentation](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/user-guides/getting-started.md#deploying-prometheus) to deploy it on your own + +In the latter case, you will need to set the proper `PodMonitorSelector` in the Prometheus instance's manifest: + +```yaml + # assuming Prometheus is deployed in the same namespace as SF + podMonitorNamespaceSelector: {} + podMonitorSelector: + matchExpressions: + - key: sf-monitoring + operator: Exists +``` + +## Statsd + +### Statsd Exporter mappings + +Statsd Exporter sidecars are preconfigured to map every statsd metric issued by Zuul and Nodepool into prometheus-compatible metrics. +You can find the mappings definitions [here (Nodepool)](./../../controllers/static/nodepool/statsd_mapping.yaml) and [here (Zuul)](./../../controllers/static/zuul/statsd_mapping.yaml). + +### Forwarding + +It is possible to use the `relayAddress` property in a SoftwareFactory CRD to define a different statsd collector for Zuul and Nodepool, for example an external graphite instance. + +## Predefined alerts + +SF-Operator defines some metrics-related alert rules for the deployed services. + +> The alert rules are defined for Prometheus. Handling these alerts (typically sending out notifications) requires another service called [AlertManager](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/user-guides/alerting.md). How to manage AlertManager is out of scope for this documentation. +You may need to [configure](https://docs.openshift.com/container-platform/4.13/monitoring/managing-alerts.html#sending-notifications-to-external-systems_managing-alerts) or +[install](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/user-guides/alerting.md) an +AlertManager instance on your cluster, +or configure Prometheus to forward alerts to an external AlertManager instance. + +The following alerting rules are created automatically at deployment time: + +| Alert name | Severity | Service | Prometheus Group Rule | Description | +|---------|------|------|--------|------------------| +| `OutOfDiskNow` | critical | Log server | disk_default.rules | The Log server has less than 10% free storage space left | +| `OutOfDiskInThreeDays` | warning | Log server | disk_default.rules | Assuming a linear trend, the Log server's storage space will fill up in less than three days | +| `ConfigUpdateFailureInPostPipeline` | critical | Zuul | config-repository_default.rules | A `config-update` job failed in the `post` pipeline, meaning a configuration change was not applied properly to the Software Factory deployment's services | +| `DIBImageBuildFailure` | warning | nodepool-builder | builder_default.rules | the disk-image-builder service (DIB) failed to build an image | +| `HighOpenStackAPIError5xxRate` | critical | nodepool-launcher | launcher_default.rules | Triggers when more than 5% of API calls on an OpenStack provider return a status code of 5xx (server-side error) over a period of 15 minutes | +| `HighFailedStateRate` | critical | nodepool-launcher | launcher_default.rules | Triggers when more than 5% of nodes on a provider are in failed state over a period of one hour | +| `HighNodeLaunchErrorRate` | critical | nodepool-launcher | launcher_default.rules | Triggers when more than 5% of node launch events end in an error state over a period of one hour | \ No newline at end of file diff --git a/doc/developer/getting_started.md b/doc/developer/getting_started.md index d8c99e4a..e562c702 100644 --- a/doc/developer/getting_started.md +++ b/doc/developer/getting_started.md @@ -46,6 +46,14 @@ You can read about [how to deploy a MicroShift instance here](./microshift.md). ## Deploy test resources +With `sfconfig`, you can quickly deploy a demo deployment consisting of the following: + +* a SoftwareFactory resource (Zuul, Nodepool, Log server and backend services) +* a companion Gerrit service hosting: + * the deployment's config repository + * a demo repository +* a companion Prometheus instance for monitoring + The operator will automatically use the current context in your kubeconfig file (i.e. whatever cluster `kubectl cluster-info` shows). Make sure that your current context is the right one for development. In this example, we are using @@ -60,9 +68,7 @@ kubectl config set-context microshift --namespace=sf Edit the [sfconfig.yaml](./../../sfconfig.yaml) configuration file to your liking, for example by setting up a custom FQDN. -Then run the `sfconfig` command to deploy a SoftwareFactory resource, a companion Gerrit service -preconfigured to host the deployment's config repository and a demo repository, and a companion -Prometheus: +Then run the `sfconfig` command: ```sh go run ./cli/sfconfig diff --git a/doc/operator/getting_started.md b/doc/operator/getting_started.md index 1187a7f3..a768dbb6 100644 --- a/doc/operator/getting_started.md +++ b/doc/operator/getting_started.md @@ -85,6 +85,11 @@ logservers.sf.softwarefactory-project.io softwarefactories.sf.softwarefactory-project.io ``` +Note that the SF-Operator OLM package depends on the following operators: + +* [cert-manager](https://cert-manager.io) +* [prometheus-operator](https://prometheus-operator.dev) + Congratulations, the SF Operator is now running on your cluster! ## Troubleshooting