Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigation: Migrate the existing alerts to mimir alertmanager #3746

Open
3 tasks
Tracked by #3743
Rotfuks opened this issue Oct 29, 2024 · 9 comments
Open
3 tasks
Tracked by #3743

Investigation: Migrate the existing alerts to mimir alertmanager #3746

Rotfuks opened this issue Oct 29, 2024 · 9 comments
Assignees
Labels
team/atlas Team Atlas

Comments

@Rotfuks
Copy link
Contributor

Rotfuks commented Oct 29, 2024

Motivation

We already have a long set of alerts that are managed by the prometheus alertmanager. We need to make sure they will also work in the mimir alertmanager.

Todo

  • Investigate what we need to do
    • Potentially just configure the ruler to use the new alert manager
    • Potentially we need to add a tenant label

Outcome

  • we know how to migrate the existing alerts to the new alertmanager
@github-project-automation github-project-automation bot moved this to Inbox 📥 in Roadmap Oct 29, 2024
@Rotfuks Rotfuks added the team/atlas Team Atlas label Oct 29, 2024
@Rotfuks Rotfuks changed the title Migrate the existing alerts to mimir alertmanager Investigation: Migrate the existing alerts to mimir alertmanager Oct 29, 2024
@hervenicol hervenicol assigned hervenicol and unassigned hervenicol Nov 8, 2024
@TheoBrigitte TheoBrigitte self-assigned this Nov 13, 2024
@TheoBrigitte
Copy link
Member

I did manage to deploy Mimir's Alertmanager and configure it in Grafana, but I have not yet been able to load alerts in it.

Here are the steps taken so far:

  • Enabled Mimir's Alertmanager in the chart
  • Configure object storage
  • Update ruler alertmanager url
  • Configure datasource in Grafana

Mimir values diff

--- values-mimir.original.yaml	2024-11-13 10:59:57.206365528 +0100
+++ values-mimir.yaml	2024-11-14 09:13:24.947664850 +0100
@@ -1,3 +1,4 @@
+USER-SUPPLIED VALUES:
 hpa:
   distributor:
     enabled: true
@@ -36,6 +37,8 @@
     enabled: true
     image:
       repository: gsoci.azurecr.io/giantswarm/mimir-continuous-test
+  alertmanager:
+    enabled: true
   distributor:
     replicas: 1
     resources:
@@ -102,6 +105,36 @@
         value: golem
       - key: name
         value: giantswarm-golem-mimir-ruler
+  - apiVersion: objectstorage.giantswarm.io/v1alpha1
+    kind: Bucket
+    metadata:
+      annotations:
+        meta.helm.sh/release-name: mimir
+        meta.helm.sh/release-namespace: mimir
+      labels:
+        app.kubernetes.io/instance: mimir-common
+        app.kubernetes.io/managed-by: Helm
+        app.kubernetes.io/name: mimir-common
+        application.giantswarm.io/team: atlas
+      name: giantswarm-golem-mimir-common
+      namespace: mimir
+    spec:
+      accessRole:
+        extraBucketNames:
+        - giantswarm-golem-mimir
+        roleName: giantswarm-golem-mimir
+        serviceAccountName: mimir
+        serviceAccountNamespace: mimir
+      expirationPolicy:
+        days: 100
+      name: giantswarm-golem-mimir-common
+      tags:
+      - key: app
+        value: mimir
+      - key: installation
+        value: golem
+      - key: name
+        value: giantswarm-golem-mimir-common
   gateway:
     autoscaling:
       enabled: true
@@ -186,6 +219,7 @@
         storage:
           backend: s3
           s3:
+            bucket_name: giantswarm-golem-mimir-common
             endpoint: s3.eu-west-2.amazonaws.com
             region: eu-west-2
       distributor:
@@ -208,7 +242,7 @@
         ruler_max_rule_groups_per_tenant: 0
         ruler_max_rules_per_rule_group: 0
       ruler:
-        alertmanager_url: http://alertmanager-operated.monitoring:9093
+        alertmanager_url: "http://mimir-alertmanager.mimir.svc:8080/alertmanager"
       ruler_storage:
         s3:
           bucket_name: giantswarm-golem-mimir-ruler

Grafana values diff

--- values-grafana.original.yaml	2024-11-13 09:36:10.332740268 +0100
+++ values-grafana.yaml	2024-11-14 10:52:40.833635135 +0100
@@ -72,11 +72,14 @@
         - name: Mimir Alertmanager
           type: alertmanager
           uid: mimir-alertmanager
-          url: http://mimir-alertmanager.mimir.svc/alertmanager
+          url: http://mimir-alertmanager.mimir.svc:8080/alertmanager
           access: proxy
           jsonData:
             handleGrafanaManagedAlerts: false
-            implementation: mimir
+            implementation: prometheus
+            httpHeaderName1: X-Scope-OrgID
+          secureJsonData:
+            httpHeaderValue1: 1
     kind: ConfigMap
     metadata:
       annotations:

@QuantumEnigmaa
Copy link

QuantumEnigmaa commented Nov 25, 2024

We decided to use a dedicated service account for mimir-alertmanager and thus also update the structuredConfig in the values with the following :

    alertmanager_storage:
      s3:
        bucket_name: 'giantswarm-golem-mimir-alertmanager'

... As well as the alertmanager field as such :

  alertmanager:
    enabled: true
    serviceAccount:
      create: true
      name: "mimir-alertmanager"
      annotations:
        # We use arn:aws-cn:iam for china and arn:aws:iam for the rest
        eks.amazonaws.com/role-arn: arn:aws:iam::<aws-account-id>:role/giantswarm-golem-mimir-alertmanager

However we encountered 2 issues :

In order for us to be able to go on in that direction without using too much workarounds, we'll need to wait for mimir helm chart's next release.

@QuentinBisson
Copy link

Do we need a custom bucket for this? I think we could use the ruler bucket right?

@QuantumEnigmaa
Copy link

QuantumEnigmaa commented Nov 27, 2024

I think we could use the ruler bucket right?

The issue is the same : we need the next mimir release because even the possibility of choosing the serviceAccount for the mimir-alertmanager was only added recently and is not yet released. This means that with the mimir version we currently use, mimir-alertmanager can only use the default mimir service account and the associated bucket.

Do we need a custom bucket for this?

I think it's better to put some data segregation, as much on a logical level as on a security point of view.

Concerning the tests done on golem, I managed to have the mimir-alertmanager run without any errors as a statefulset by creating its dedicated service account from the extraObjects section of mmir's values and manually editing the sts.

Ujnfortunately, even though the pod runs flawlessly, no notification policies nor any contact points are displayed on grafana for the mimir-alertmanager. The ruler has been redirected towards the mimir-alertmanager and isn't logging any error so I'm not sure what's blocking here.

Image

Image

@QuentinBisson
Copy link

Did you check thé datasource works? Maybe thé grafana logs Can help?

@QuantumEnigmaa
Copy link

So after checking on Grafana UI, the mimir-alertmanager datasource isn't working so I manually created one that's working. However even with this working datasource, there are still no contact points or notification policies associated with it and neither the mimir-alertmanager pod nor the grafana one give useful insight on why.

@QuentinBisson
Copy link

This is not going to work :D

ruler:
alertmanager_url: http://alertmanager-operated.monitoring:9093
enable_api: true
rule_path: /data

Not it makes sense that we have no contact points as we currently do not have an alert template configured

See: alertmanager_fallback_config.yaml: |
receivers:
- name: default-receiver
route:
receiver: default-receiver

@QuantumEnigmaa
Copy link

Yeah I noticed that in the meantime and moved the ruler.alertmanager_url under the structuredConfig section.
However I'm struggling with the fallback config as there are almost no examples or doc explaining how to write it :/

@QuentinBisson
Copy link

It's a default alertmanager config so the one on the old alertmanager should work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team/atlas Team Atlas
Projects
Status: Inbox 📥
Development

No branches or pull requests

5 participants