Skip to content

Commit

Permalink
Monitoring: document feature
Browse files Browse the repository at this point in the history
Add a monitoring page in the deployment documentation.

Change-Id: Iadf2d03957c24e29c6c8adb921af97cc924c4e64
  • Loading branch information
mhuin committed Oct 10, 2023
1 parent 7b55a7d commit ba3b258
Show file tree
Hide file tree
Showing 4 changed files with 108 additions and 3 deletions.
1 change: 1 addition & 0 deletions doc/deployment/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,5 +12,6 @@ and managing a Software Factory Custom Resource through SF-Operator.
1. [Zuul](./zuul.md)
1. Logserver
1. [Setting up certificates](./certificates.md)
1. [Monitoring](./monitoring.md)
1. [Deleting a deployment](./delete.md)
1. [Custom Resource Definitions reference](./crds.md)
93 changes: 93 additions & 0 deletions doc/deployment/monitoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Monitoring

Here you will find information about what monitoring is available on services deployed with SF-Operator.

## Table of Contents

1. [Concepts](#concepts)
1. [Accessing the metrics](#accessing-the-metrics)
1. [Statsd](#statsd)
1. [Predefined alerts](#predefined-alerts)

## Concepts

SF-Operator use the [prometheus-operator](https://prometheus-operator.dev/) to expose and collect service metrics.
SF-Operator will automatically create [PodMonitors](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/api.md#podmonitor) for the following services:

* Log Server
* [Nodepool](./nodepool.md)
* [Zuul](./zuul.md)

| Service | Statsd metrics | Prometheus metrics |
|---------|--------|-------|
| Log Server |||
| Nodepool |||
| Zuul |||

The `PodMonitors` are set with the label key `sf-monitoring` (and a value equal to the monitored service name); that key can be used for filtering metrics.

You can list the PodMonitors this way:

```sh
kubectl get podmonitors
```

The `Log server` service runs the [Node Exporter](https://prometheus.io/docs/guides/node-exporter/) process as a sidecar container as well, in order to expose disk space-related metrics.

For services that expose statsd metrics, a sidecar container running [Statsd Exporter](https://github.com/prometheus/statsd_exporter)
is added to the service pod, so that these metrics can be consumed by a Prometheus instance.

## Accessing the metrics

If [enabled in your cluster](https://docs.openshift.com/container-platform/4.13/monitoring/enabling-monitoring-for-user-defined-projects.html#enabling-monitoring-for-user-defined-projects), metrics will automatically
be collected by the cluster-wide Prometheus instance. Check with your cluster admin about getting access to your metrics.

If this feature isn't enabled in your cluster, you will need to deploy your own Prometheus instance to collect the metrics on your own.
To do so, you can either:

* Follow the [CLI documentation](./../cli/index.md#prometheus) to deploy a standalone Prometheus instance
* Follow the [prometheus-operator's documentation](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/user-guides/getting-started.md#deploying-prometheus) to deploy it on your own

In the latter case, you will need to set the proper `PodMonitorSelector` in the Prometheus instance's manifest:

```yaml
# assuming Prometheus is deployed in the same namespace as SF
podMonitorNamespaceSelector: {}
podMonitorSelector:
matchExpressions:
- key: sf-monitoring
operator: Exists
```
## Statsd
### Statsd Exporter mappings
Statsd Exporter sidecars are preconfigured to map every statsd metric issued by Zuul and Nodepool into prometheus-compatible metrics.
You can find the mappings definitions [here (Nodepool)](./../../controllers/static/nodepool/statsd_mapping.yaml) and [here (Zuul)](./../../controllers/static/zuul/statsd_mapping.yaml).
### Forwarding
It is possible to use the `relayAddress` property in a SoftwareFactory CRD to define a different statsd collector for Zuul and Nodepool, for example an external graphite instance.

## Predefined alerts

SF-Operator defines some metrics-related alert rules for the deployed services.

> The alert rules are defined for Prometheus. Handling these alerts (typically sending out notifications) requires another service called [AlertManager](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/user-guides/alerting.md). How to manage AlertManager is out of scope for this documentation.
You may need to [configure](https://docs.openshift.com/container-platform/4.13/monitoring/managing-alerts.html#sending-notifications-to-external-systems_managing-alerts) or
[install](https://github.com/prometheus-operator/prometheus-operator/blob/main/Documentation/user-guides/alerting.md) an
AlertManager instance on your cluster,
or configure Prometheus to forward alerts to an external AlertManager instance.

The following alerting rules are created automatically at deployment time:

| Alert name | Severity | Service | Prometheus Group Rule | Description |
|---------|------|------|--------|------------------|
| `OutOfDiskNow` | critical | Log server | disk_default.rules | The Log server has less than 10% free storage space left |
| `OutOfDiskInThreeDays` | warning | Log server | disk_default.rules | Assuming a linear trend, the Log server's storage space will fill up in less than three days |
| `ConfigUpdateFailureInPostPipeline` | critical | Zuul | config-repository_default.rules | A `config-update` job failed in the `post` pipeline, meaning a configuration change was not applied properly to the Software Factory deployment's services |
| `DIBImageBuildFailure` | warning | nodepool-builder | builder_default.rules | the disk-image-builder service (DIB) failed to build an image |
| `HighOpenStackAPIError5xxRate` | critical | nodepool-launcher | launcher_default.rules | Triggers when more than 5% of API calls on an OpenStack provider return a status code of 5xx (server-side error) over a period of 15 minutes |
| `HighFailedStateRate` | critical | nodepool-launcher | launcher_default.rules | Triggers when more than 5% of nodes on a provider are in failed state over a period of one hour |
| `HighNodeLaunchErrorRate` | critical | nodepool-launcher | launcher_default.rules | Triggers when more than 5% of node launch events end in an error state over a period of one hour |
12 changes: 9 additions & 3 deletions doc/developer/getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,14 @@ You can read about [how to deploy a MicroShift instance here](./microshift.md).

## Deploy test resources

With `sfconfig`, you can quickly deploy a demo deployment consisting of the following:

* a SoftwareFactory resource (Zuul, Nodepool, Log server and backend services)
* a companion Gerrit service hosting:
* the deployment's config repository
* a demo repository
* a companion Prometheus instance for monitoring

The operator will automatically use the current context in your kubeconfig file
(i.e. whatever cluster `kubectl cluster-info` shows).
Make sure that your current context is the right one for development. In this example, we are using
Expand All @@ -60,9 +68,7 @@ kubectl config set-context microshift --namespace=sf

Edit the [sfconfig.yaml](./../../sfconfig.yaml) configuration file to your liking, for example by setting up a custom FQDN.

Then run the `sfconfig` command to deploy a SoftwareFactory resource, a companion Gerrit service
preconfigured to host the deployment's config repository and a demo repository, and a companion
Prometheus:
Then run the `sfconfig` command:

```sh
go run ./cli/sfconfig
Expand Down
5 changes: 5 additions & 0 deletions doc/operator/getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,11 @@ logservers.sf.softwarefactory-project.io
softwarefactories.sf.softwarefactory-project.io
```

Note that the SF-Operator OLM package depends on the following operators:

* [cert-manager](https://cert-manager.io)
* [prometheus-operator](https://prometheus-operator.dev)

Congratulations, the SF Operator is now running on your cluster!

## Troubleshooting
Expand Down

0 comments on commit ba3b258

Please sign in to comment.