Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When using Alloy clustering, got a lot err-mimir-duplicate-label-names errors #2323

Open
kwkevinchan opened this issue Jan 2, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@kwkevinchan
Copy link

kwkevinchan commented Jan 2, 2025

What's wrong?

I am migrating our metrics platform from Prometheus to Mimir + Alloy (deployment).
Initially, after deploying Alloy into our EKS cluster, everything seemed to work great.
However, after about 30 minutes, I started seeing a lot of err-mimir-duplicate-label-names errors for random metrics and label names.
Additionally, some metrics have significantly different values compared to our existing Prometheus system's output.

Also, I found a similar issue #1006, so I tried two ways to resolve my issue, but both failed:

After these tests, I switched back to our existing Prometheus with remote_write into Mimir, and Mimir's metrics are looking so far so good. Therefore, I believe Mimir's configuration is correct.

Steps to reproduce

  • deploy alloy, mimir-distributed helm chart into EKS
  • alloy configuration setting: scrape metrics from Prometheus CRDs
  • atfer running about 30m, start log err-mimir-duplicate-label-names errors

System information

EKS 1.29

Software version

Alloy v1.4.3 & Alloy v1.5.1 / Mimir 2.14.0

Configuration

alloy:
  configMap:
    # -- Create a new ConfigMap for the config file.
    create: true
    # -- Content to assign to the new ConfigMap.  This is passed into `tpl` allowing for templating from values.
    content: |-
      logging {
        level  = "warn"
        format = "json"
      }

      prometheus.remote_write "mimir" {
        // Send metrics to a Mimir instance
        endpoint {
          url = "http://_http-metrics._tcp.mimir-gateway.metrics.svc.cluster.local/api/v1/push"

          queue_config {
            sample_age_limit = "5m"
          }
        }
      }

      // import the service monitor
      prometheus.operator.servicemonitors "services" {
        forward_to = [prometheus.remote_write.mimir.receiver]

        // this is the default scrape interval for all service monitors
        // decrease this value will increase the load on the Mimir write path
        scrape {
          default_scrape_interval = "60s"
        }

        clustering {
          enabled = true
        }
      }

      // import the pod monitor
      prometheus.operator.podmonitors "pods" {
        forward_to = [prometheus.remote_write.mimir.receiver]

        // this is the default scrape interval for all pod monitors
        // decrease this value will increase the load on the Mimir write path
        scrape {
          default_scrape_interval = "60s"
        }

        clustering {
          enabled = true
        }
      }

      // import the prometheus rules
      mimir.rules.kubernetes "rules" {
          address = "http://_http-metrics._tcp.mimir-gateway.metrics.svc.cluster.local/"
      }


  clustering:
    # -- Deploy Alloy in a cluster to allow for load distribution.
    enabled: true

  extraEnv: 
    - name: "GOMEMLIMT"
      value: "1.8GiB"
    - name: "GOGC"
      value: "95"

  resources: 
    requests:
      cpu: "200m"
      memory: "3Gi"
    limits:
      cpu: "1"
      memory: "3Gi"

image:
  # -- Grafana Alloy image registry (defaults to docker.io)
  registry: "docker.io"
  # -- Grafana Alloy image repository.
  repository: grafana/alloy
  # -- (string) Grafana Alloy image tag. When empty, the Chart's appVersion is
  # used.
  tag: v1.5.1

controller:
  # -- Type of controller to use for deploying Grafana Alloy in the cluster.
  # Must be one of 'daemonset', 'deployment', or 'statefulset'.
  type: 'deployment'

  # -- Number of pods to deploy. Ignored when controller.type is 'daemonset'.
  replicas: 4

  # -- PodDisruptionBudget configuration.
  podDisruptionBudget:
    # -- Whether to create a PodDisruptionBudget for the controller.
    enabled: true
    # -- Maximum number of pods that can be unavailable during a disruption.
    # Note: Only one of minAvailable or maxUnavailable should be set.
    maxUnavailable: 1

Logs

server returned HTTP status 400 Bad Request: received a series with duplicate label name, label: 'status' series: 'nginx_ingress_controller_bytes_sent_sum{container=\"controller\", controller_class=\"k8s.io/ingress-nginx\", controller_namespace=\"ingress-nginx\", controller_pod=\"ingress-nginx-service-controller-****\", ' (err-mimir-duplicate-label-names)
server returned HTTP status 400 Bad Request: received a series with duplicate label name, label: 'zone' series: 'coredns_dns_request_size_bytes_count{container="node-cache", endpoint="metrics", instance="******:9253", job="node-local-dns", namespace="kube-system", pod="node-local-dns-***", proto="udp", ' (err-mimir-duplicate-label-names)
@kwkevinchan kwkevinchan added the bug Something isn't working label Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant