Skip to content

Commit 6daa569

Browse files
committed
update-readm
1 parent e1e3093 commit 6daa569

File tree

1 file changed

+31
-28
lines changed

1 file changed

+31
-28
lines changed

README.md

Lines changed: 31 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -18,13 +18,14 @@
1818
- [`Absent` function](#absent-function)
1919
- [Useful links](#useful-links)
2020
- [Alert routing](#alert-routing)
21-
- [Opsgenie routing](#opsgenie-routing)
21+
- [PagerDuty routing](#pagerduty-routing)
2222
- [Inhibitions](#inhibitions)
2323
- [Recording rules](#recording-rules)
2424
- [Mixins management](#mixins)
2525
- [kubernetes-mixins](#kubernetes-mixins)
2626
- [mimir-mixins](#mimir-mixins)
2727
- [loki-mixins](#loki-mixins)
28+
- [tempo-mixins](#tempo-mixins)
2829
- [Testing](#testing)
2930
- [Prometheus rules unit tests](#prometheus-rules-unit-tests)
3031
- [Test syntax](#test-syntax)
@@ -146,10 +147,12 @@ Log-based alerts are processed differently in the observability platform but app
146147
We follow standardized practices for organizing our alerts using PrometheusRule custom resources.
147148

148149
#### Mandatory annotations
150+
149151
- `description`: Detailed explanation of what happened and what the alert is monitoring
150152
- `runbook_url`: Link to a runbook page with incident management instructions
151153

152154
#### Recommended annotations
155+
153156
- `summary`: Brief overview of what the alert detected
154157
- `__dashboardUid__`: Unique identifier of the relevant dashboard
155158
- `__panelId__`: Specific panel ID within the referenced dashboard
@@ -158,24 +161,26 @@ We follow standardized practices for organizing our alerts using PrometheusRule
158161

159162
##### Dashboard URL construction
160163

161-
Alertmanager generates dashboard URLs for Opsgenie and Slack alerts using these rules:
164+
Alertmanager generates dashboard URLs for PagerDuty and Slack alerts using these rules:
162165

163166
1. With only `__dashboardUid__`: `https://grafana.domain/__dashboardUid__`
164167
2. With both `__dashboardUid__` and `dashboardQueryParams`: `https://grafana.domain/__dashboardUid__?dashboardQueryParams`
165168
3. If `dashboardExternalUrl` is set: Uses the exact URL provided
166169

167170
#### Mandatory labels
171+
168172
- `area`: Functional area (e.g., platform, apps)
169173
- `team`: Responsible team identifier
170-
- `severity`: Alert severity level (page, notify)
174+
- `severity`: Alert severity level (page, notify, ticket)
171175

172176
#### Optional labels
177+
173178
- `cancel_if_*`: Labels used for alert inhibitions
174-
- `all_pipelines: "true"`: Ensures the alert is sent to Opsgenie regardless of installation's pipeline
179+
- `all_pipelines: "true"`: Ensures the alert is sent to PagerDuty regardless of installation's pipeline
175180
176181
#### `Absent` function
177182

178-
If you want to make sure a metrics exists on one cluster, you can't just use the `absent` function anymore.
183+
If you want to make sure a metric exists on one cluster, you can't just use the `absent` function anymore.
179184
With `mimir` we have metrics for all the clusters on a single database, and it makes detecting the absence of one metrics on one cluster much harder.
180185

181186
To achieve such a test, you should do like [`MimirToGrafanaCloudExporterMissingData`](https://github.com/giantswarm/prometheus-rules/blob/d06a84e8369f4d0bafdf0d48f18120de15c8e18a/helm/prometheus-rules/templates/platform/atlas/alerting-rules/grafana-cloud.rules.yml#L33) alert does.
@@ -188,20 +193,18 @@ To achieve such a test, you should do like [`MimirToGrafanaCloudExporterMissingD
188193

189194
### Alert routing
190195

191-
Alertmanager does the routing based on the labels menitoned above.
196+
Alertmanager does the routing based on the labels mentioned above.
192197
You can see the routing rules in alertmanager's config (opsctl open `alertmanager`, then go to `Status`), section `route:`.
193198

194-
* are sent to opsgenie:
195-
* all `severity=page` alerts
196-
* are sent to slack team-specific channels:
197-
* `severity=page` or `severity=notify`
198-
* `team` defines which channel to route to.
199-
200-
#### Opsgenie routing
201-
202-
Opsgenie routing is defined in the `Teams` section of the Opsgenie application.
199+
**Alerts are routed as follows:**
203200

204-
Opsgenie route alerts based on the `team` label.
201+
* **Sent to PagerDuty:**
202+
* All `severity=page` alerts
203+
* **Sent to GitHub Issues (via alertmanager-to-github):**
204+
* All `severity=ticket` alerts
205+
* **Sent to Slack team-specific channels:**
206+
* `severity=page` or `severity=notify` alerts
207+
* The `team` label defines which channel to route to
205208

206209
### Inhibitions
207210

@@ -219,29 +222,29 @@ Official documentation for inhibit rules can be found here: https://www.promethe
219222

220223
The recording rules are located in `helm/prometheus-rules/templates/<area>/<team>/recording-rules` in the specific area/team to which they belong.
221224

222-
### Mixins management
225+
## Mixins management
223226

224227
#### kubernetes-mixins
225228

226-
To Update `kubernetes-mixins` recording rules:
229+
To update `kubernetes-mixins` recording rules:
227230

228231
* Follow the instructions in [giantswarm-kubernetes-mixin](https://github.com/giantswarm/giantswarm-kubernetes-mixin)
229-
* Run `./scripts/sync-kube-mixin.sh (?my-fancy-branch-or-tag)` to updated the `helm/prometheus-rules/templates/shared/recording-rules/kubernetes-mixins.rules.yml` folder.
230-
* make sure to update [grafana dashboards](https://github.com/giantswarm/dashboards/tree/master/helm/dashboards/dashboards/mixin)
232+
* Run `./scripts/sync-kube-mixin.sh (?my-fancy-branch-or-tag)` to update the `helm/prometheus-rules/templates/shared/recording-rules/kubernetes-mixins.rules.yml` folder
233+
* Make sure to update [grafana dashboards](https://github.com/giantswarm/dashboards/tree/master/helm/dashboards/dashboards/mixin)
231234

232235
#### mimir-mixins
233236

234237
To update `mimir-mixins` recording rules:
235238

236239
* Run `./mimir/update.sh`
237-
* make sure to update [grafana dashboards](https://github.com/giantswarm/dashboards)
240+
* Make sure to update [grafana dashboards](https://github.com/giantswarm/dashboards)
238241

239242
#### loki-mixins
240243

241244
To update `loki-mixins` recording rules:
242245

243246
* Run `./loki/update.sh`
244-
* make sure to update [grafana dashboards](https://github.com/giantswarm/dashboards)
247+
* Make sure to update [grafana dashboards](https://github.com/giantswarm/dashboards)
245248

246249
#### tempo-mixins
247250

@@ -253,7 +256,7 @@ To update `tempo-mixins` alerting rules:
253256

254257
You can run all tests by running `make test`.
255258

256-
There are 4 different types tests implemented:
259+
There are 4 different types of tests implemented:
257260

258261
- [Prometheus rules unit tests](#prometheus-rules-unit-tests)
259262
- [Alertmanager inhibition dependency check](#alertmanager-inhibition-dependency-check)
@@ -318,7 +321,7 @@ This is a good example of an input series for testing a `range` query.
318321
319322
#### Test templating
320323
321-
In order to reduce the need for provider-specific test files, you can use `$provider` in your test file and our tooling will replace it with the provider name.
324+
To reduce the need for provider-specific test files, you can use `$provider` in your test file and the tooling will replace it with the provider name.
322325
323326
#### Test exceptions
324327
@@ -372,7 +375,7 @@ make test-rules rules_type=loki
372375
373376
#### Test "no data" case
374377
375-
* It can be nice to test what happens when serie does not exist.
378+
* It can be nice to test what happens when a series does not exist.
376379
* For instance, You can have your first 60 iterations with no data like this: `_x60`
377380
378381
#### Useful links
@@ -396,10 +399,10 @@ This is possible thanks to the alertmanager config file stored in the [observabi
396399
397400
This is what we call the inhibition dependency chain.
398401
399-
One can check whether inhibition labels (mostly "cancel_if_" prefixed ones) are well defined and triggered by a corresponding label in the alerting rules by running the `make test-inhibitions` command at the projet's root directory.
402+
You can check whether inhibition labels (mostly "cancel_if_" prefixed ones) are well defined and triggered by a corresponding label in the alerting rules by running the `make test-inhibitions` command at the project's root directory.
400403
401-
This command will output the list of missing labels. Each of them will need to be defined in either the alerting rules or the alertmanager config file depending on its nature : either an inhibition label or its source label.
402-
If there is no labels outputed, this means tests passed and did not find missing inhibition labels.
404+
This command will output the list of missing labels. Each of them will need to be defined in either the alerting rules or the alertmanager config file depending on its nature: either an inhibition label or its source label.
405+
If no labels are output, this means tests passed and did not find missing inhibition labels.
403406
404407
![inhibition-graph](assets/inhibition-graph.png)
405408

0 commit comments

Comments
 (0)