When using Alloy clustering, got a lot err-mimir-duplicate-label-names errors #2323

kwkevinchan · 2025-01-02T09:10:38Z

What's wrong?

I am migrating our metrics platform from Prometheus to Mimir + Alloy (deployment).
Initially, after deploying Alloy into our EKS cluster, everything seemed to work great.
However, after about 30 minutes, I started seeing a lot of err-mimir-duplicate-label-names errors for random metrics and label names.
Additionally, some metrics have significantly different values compared to our existing Prometheus system's output.

Also, I found a similar issue #1006, so I tried two ways to resolve my issue, but both failed:

scale pod between 2 and 4
add discovery.relabel , ref from When using clustering, exporters may not work correctly due to instance label #1009

After these tests, I switched back to our existing Prometheus with remote_write into Mimir, and Mimir's metrics are looking so far so good. Therefore, I believe Mimir's configuration is correct.

Steps to reproduce

deploy alloy, mimir-distributed helm chart into EKS
alloy configuration setting: scrape metrics from Prometheus CRDs
atfer running about 30m, start log err-mimir-duplicate-label-names errors

System information

EKS 1.29

Software version

Alloy v1.4.3 & Alloy v1.5.1 / Mimir 2.14.0

Configuration

alloy:
  configMap:
    # -- Create a new ConfigMap for the config file.
    create: true
    # -- Content to assign to the new ConfigMap.  This is passed into `tpl` allowing for templating from values.
    content: |-
      logging {
        level  = "warn"
        format = "json"
      }

      prometheus.remote_write "mimir" {
        // Send metrics to a Mimir instance
        endpoint {
          url = "http://_http-metrics._tcp.mimir-gateway.metrics.svc.cluster.local/api/v1/push"

          queue_config {
            sample_age_limit = "5m"
          }
        }
      }

      // import the service monitor
      prometheus.operator.servicemonitors "services" {
        forward_to = [prometheus.remote_write.mimir.receiver]

        // this is the default scrape interval for all service monitors
        // decrease this value will increase the load on the Mimir write path
        scrape {
          default_scrape_interval = "60s"
        }

        clustering {
          enabled = true
        }
      }

      // import the pod monitor
      prometheus.operator.podmonitors "pods" {
        forward_to = [prometheus.remote_write.mimir.receiver]

        // this is the default scrape interval for all pod monitors
        // decrease this value will increase the load on the Mimir write path
        scrape {
          default_scrape_interval = "60s"
        }

        clustering {
          enabled = true
        }
      }

      // import the prometheus rules
      mimir.rules.kubernetes "rules" {
          address = "http://_http-metrics._tcp.mimir-gateway.metrics.svc.cluster.local/"
      }


  clustering:
    # -- Deploy Alloy in a cluster to allow for load distribution.
    enabled: true

  extraEnv: 
    - name: "GOMEMLIMT"
      value: "1.8GiB"
    - name: "GOGC"
      value: "95"

  resources: 
    requests:
      cpu: "200m"
      memory: "3Gi"
    limits:
      cpu: "1"
      memory: "3Gi"

image:
  # -- Grafana Alloy image registry (defaults to docker.io)
  registry: "docker.io"
  # -- Grafana Alloy image repository.
  repository: grafana/alloy
  # -- (string) Grafana Alloy image tag. When empty, the Chart's appVersion is
  # used.
  tag: v1.5.1

controller:
  # -- Type of controller to use for deploying Grafana Alloy in the cluster.
  # Must be one of 'daemonset', 'deployment', or 'statefulset'.
  type: 'deployment'

  # -- Number of pods to deploy. Ignored when controller.type is 'daemonset'.
  replicas: 4

  # -- PodDisruptionBudget configuration.
  podDisruptionBudget:
    # -- Whether to create a PodDisruptionBudget for the controller.
    enabled: true
    # -- Maximum number of pods that can be unavailable during a disruption.
    # Note: Only one of minAvailable or maxUnavailable should be set.
    maxUnavailable: 1

Logs

server returned HTTP status 400 Bad Request: received a series with duplicate label name, label: 'status' series: 'nginx_ingress_controller_bytes_sent_sum{container=\"controller\", controller_class=\"k8s.io/ingress-nginx\", controller_namespace=\"ingress-nginx\", controller_pod=\"ingress-nginx-service-controller-****\", ' (err-mimir-duplicate-label-names)
server returned HTTP status 400 Bad Request: received a series with duplicate label name, label: 'zone' series: 'coredns_dns_request_size_bytes_count{container="node-cache", endpoint="metrics", instance="******:9253", job="node-local-dns", namespace="kube-system", pod="node-local-dns-***", proto="udp", ' (err-mimir-duplicate-label-names)

The text was updated successfully, but these errors were encountered:

kwkevinchan added the bug Something isn't working label Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When using Alloy clustering, got a lot err-mimir-duplicate-label-names errors #2323

When using Alloy clustering, got a lot err-mimir-duplicate-label-names errors #2323

kwkevinchan commented Jan 2, 2025 •

edited

Loading

When using Alloy clustering, got a lot err-mimir-duplicate-label-names errors #2323

When using Alloy clustering, got a lot err-mimir-duplicate-label-names errors #2323

Comments

kwkevinchan commented Jan 2, 2025 • edited Loading

What's wrong?

Steps to reproduce

System information

Software version

Configuration

Logs

kwkevinchan commented Jan 2, 2025 •

edited

Loading