Investigation: Migrate the existing alerts to mimir alertmanager #3746

Rotfuks · 2024-10-29T12:10:57Z

Motivation

We already have a long set of alerts that are managed by the prometheus alertmanager. We need to make sure they will also work in the mimir alertmanager.

Todo

Investigate what we need to do
- Potentially just configure the ruler to use the new alert manager
- Potentially we need to add a tenant label

Outcome

we know how to migrate the existing alerts to the new alertmanager

TheoBrigitte · 2024-11-15T12:07:59Z

I did manage to deploy Mimir's Alertmanager and configure it in Grafana, but I have not yet been able to load alerts in it.

Here are the steps taken so far:

Enabled Mimir's Alertmanager in the chart
Configure object storage
Update ruler alertmanager url
Configure datasource in Grafana

Mimir values diff

--- values-mimir.original.yaml	2024-11-13 10:59:57.206365528 +0100
+++ values-mimir.yaml	2024-11-14 09:13:24.947664850 +0100
@@ -1,3 +1,4 @@
+USER-SUPPLIED VALUES:
 hpa:
   distributor:
     enabled: true
@@ -36,6 +37,8 @@
     enabled: true
     image:
       repository: gsoci.azurecr.io/giantswarm/mimir-continuous-test
+  alertmanager:
+    enabled: true
   distributor:
     replicas: 1
     resources:
@@ -102,6 +105,36 @@
         value: golem
       - key: name
         value: giantswarm-golem-mimir-ruler
+  - apiVersion: objectstorage.giantswarm.io/v1alpha1
+    kind: Bucket
+    metadata:
+      annotations:
+        meta.helm.sh/release-name: mimir
+        meta.helm.sh/release-namespace: mimir
+      labels:
+        app.kubernetes.io/instance: mimir-common
+        app.kubernetes.io/managed-by: Helm
+        app.kubernetes.io/name: mimir-common
+        application.giantswarm.io/team: atlas
+      name: giantswarm-golem-mimir-common
+      namespace: mimir
+    spec:
+      accessRole:
+        extraBucketNames:
+        - giantswarm-golem-mimir
+        roleName: giantswarm-golem-mimir
+        serviceAccountName: mimir
+        serviceAccountNamespace: mimir
+      expirationPolicy:
+        days: 100
+      name: giantswarm-golem-mimir-common
+      tags:
+      - key: app
+        value: mimir
+      - key: installation
+        value: golem
+      - key: name
+        value: giantswarm-golem-mimir-common
   gateway:
     autoscaling:
       enabled: true
@@ -186,6 +219,7 @@
         storage:
           backend: s3
           s3:
+            bucket_name: giantswarm-golem-mimir-common
             endpoint: s3.eu-west-2.amazonaws.com
             region: eu-west-2
       distributor:
@@ -208,7 +242,7 @@
         ruler_max_rule_groups_per_tenant: 0
         ruler_max_rules_per_rule_group: 0
       ruler:
-        alertmanager_url: http://alertmanager-operated.monitoring:9093
+        alertmanager_url: "http://mimir-alertmanager.mimir.svc:8080/alertmanager"
       ruler_storage:
         s3:
           bucket_name: giantswarm-golem-mimir-ruler

Grafana values diff

--- values-grafana.original.yaml	2024-11-13 09:36:10.332740268 +0100
+++ values-grafana.yaml	2024-11-14 10:52:40.833635135 +0100
@@ -72,11 +72,14 @@
         - name: Mimir Alertmanager
           type: alertmanager
           uid: mimir-alertmanager
-          url: http://mimir-alertmanager.mimir.svc/alertmanager
+          url: http://mimir-alertmanager.mimir.svc:8080/alertmanager
           access: proxy
           jsonData:
             handleGrafanaManagedAlerts: false
-            implementation: mimir
+            implementation: prometheus
+            httpHeaderName1: X-Scope-OrgID
+          secureJsonData:
+            httpHeaderValue1: 1
     kind: ConfigMap
     metadata:
       annotations:

QuantumEnigmaa · 2024-11-25T12:55:42Z

We decided to use a dedicated service account for mimir-alertmanager and thus also update the structuredConfig in the values with the following :

    alertmanager_storage:
      s3:
        bucket_name: 'giantswarm-golem-mimir-alertmanager'

... As well as the alertmanager field as such :

  alertmanager:
    enabled: true
    serviceAccount:
      create: true
      name: "mimir-alertmanager"
      annotations:
        # We use arn:aws-cn:iam for china and arn:aws:iam for the rest
        eks.amazonaws.com/role-arn: arn:aws:iam::<aws-account-id>:role/giantswarm-golem-mimir-alertmanager

However we encountered 2 issues :

In the mimir version we use, the mimir-alertmanager service account template do not exist and thus cannot be deployed in the cluster
The serviceAccountName field in the mimi-alertmanager sts template isn't working, hence the upstream issue that has been merged : Helm : fix alermanager-statefulset serviceAccountName field grafana/mimir#10016

In order for us to be able to go on in that direction without using too much workarounds, we'll need to wait for mimir helm chart's next release.

QuentinBisson · 2024-11-27T09:56:06Z

Do we need a custom bucket for this? I think we could use the ruler bucket right?

QuantumEnigmaa · 2024-11-27T15:49:27Z

I think we could use the ruler bucket right?

The issue is the same : we need the next mimir release because even the possibility of choosing the serviceAccount for the mimir-alertmanager was only added recently and is not yet released. This means that with the mimir version we currently use, mimir-alertmanager can only use the default mimir service account and the associated bucket.

Do we need a custom bucket for this?

I think it's better to put some data segregation, as much on a logical level as on a security point of view.

Concerning the tests done on golem, I managed to have the mimir-alertmanager run without any errors as a statefulset by creating its dedicated service account from the extraObjects section of mmir's values and manually editing the sts.

Ujnfortunately, even though the pod runs flawlessly, no notification policies nor any contact points are displayed on grafana for the mimir-alertmanager. The ruler has been redirected towards the mimir-alertmanager and isn't logging any error so I'm not sure what's blocking here.

QuentinBisson · 2024-11-27T17:26:54Z

Did you check thé datasource works? Maybe thé grafana logs Can help?

QuantumEnigmaa · 2024-11-28T11:51:59Z

So after checking on Grafana UI, the mimir-alertmanager datasource isn't working so I manually created one that's working. However even with this working datasource, there are still no contact points or notification policies associated with it and neither the mimir-alertmanager pod nor the grafana one give useful insight on why.

QuentinBisson · 2024-11-28T14:52:47Z

This is not going to work :D

ruler:
alertmanager_url: http://alertmanager-operated.monitoring:9093
enable_api: true
rule_path: /data

Not it makes sense that we have no contact points as we currently do not have an alert template configured

See: alertmanager_fallback_config.yaml: |
receivers:
- name: default-receiver
route:
receiver: default-receiver

QuantumEnigmaa · 2024-11-28T16:38:33Z

Yeah I noticed that in the meantime and moved the ruler.alertmanager_url under the structuredConfig section.
However I'm struggling with the fallback config as there are almost no examples or doc explaining how to write it :/

QuentinBisson · 2024-11-28T20:52:03Z

It's a default alertmanager config so the one on the old alertmanager should work

Rotfuks mentioned this issue Oct 29, 2024

Migrate to Mimir Alertmanager #3743

Open

github-project-automation bot added this to Roadmap Oct 29, 2024

github-project-automation bot moved this to Inbox 📥 in Roadmap Oct 29, 2024

Rotfuks added the team/atlas Team Atlas label Oct 29, 2024

Rotfuks changed the title ~~Migrate the existing alerts to mimir alertmanager~~ Investigation: Migrate the existing alerts to mimir alertmanager Oct 29, 2024

hervenicol assigned hervenicol and unassigned hervenicol Nov 8, 2024

TheoBrigitte self-assigned this Nov 13, 2024

QuantumEnigmaa mentioned this issue Nov 25, 2024

Deploy and test the Mimir Alert Manager #3744

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigation: Migrate the existing alerts to mimir alertmanager #3746

Investigation: Migrate the existing alerts to mimir alertmanager #3746

Rotfuks commented Oct 29, 2024 •

edited

Loading

TheoBrigitte commented Nov 15, 2024

QuantumEnigmaa commented Nov 25, 2024 •

edited

Loading

QuentinBisson commented Nov 27, 2024

QuantumEnigmaa commented Nov 27, 2024 •

edited

Loading

QuentinBisson commented Nov 27, 2024

QuantumEnigmaa commented Nov 28, 2024

QuentinBisson commented Nov 28, 2024

QuantumEnigmaa commented Nov 28, 2024

QuentinBisson commented Nov 28, 2024

Investigation: Migrate the existing alerts to mimir alertmanager #3746

Investigation: Migrate the existing alerts to mimir alertmanager #3746

Comments

Rotfuks commented Oct 29, 2024 • edited Loading

Motivation

Todo

Outcome

TheoBrigitte commented Nov 15, 2024

Mimir values diff

Grafana values diff

QuantumEnigmaa commented Nov 25, 2024 • edited Loading

QuentinBisson commented Nov 27, 2024

QuantumEnigmaa commented Nov 27, 2024 • edited Loading

QuentinBisson commented Nov 27, 2024

QuantumEnigmaa commented Nov 28, 2024

QuentinBisson commented Nov 28, 2024

QuantumEnigmaa commented Nov 28, 2024

QuentinBisson commented Nov 28, 2024

Rotfuks commented Oct 29, 2024 •

edited

Loading

QuantumEnigmaa commented Nov 25, 2024 •

edited

Loading

QuantumEnigmaa commented Nov 27, 2024 •

edited

Loading