Tracking: Address Clustering Issues #784

thampiotr · 2024-05-07T16:15:52Z

Request

There are a few issues that users report and we're observing, that can lead to data issues: 1) there can be gaps in metrics under some circumstances when instances join cluster, 2) there can be elevated errors and alerts when writing to TSDB in some cases, 3) there can be duplicated metrics in other cases.

The extent of these issues is not high, but since these are potential data loss issues, we want to address them and fully understand the problem.

Use case

Sending data should not be dropped.

Tasks

Give feedback

clustering: deal with gaps during cluster changes #249

bug enhancement frozen-due-to-age
[Clustering] Filter the label "instance" from the hash computation agent#6792

bug frozen-due-to-age needs-attention variant/flow
Review in detail the errors when remote writing and identify the most common ones
remote_write: add histogram info to error logs in queue_manager prometheus/prometheus#14326
Add SR on dashboards mixin: Add SR to remote_write dashboard #1100
Verify dashboards latest changes
Create an issue to track the known OOO problems.
Review reports of duplicate samples errors
Investigate resizing causing out-of-order errors
Review reports of gaps in metrics
Reduce spammy logging relatd to clustering: https://github.com/grafana/alloy/blob/881c4b7cf72d0a9c068cca04225fac042e6e4714/internal/service/cluster/cluster.go#L220
Rate limit how often components get notified about cluster changes #1261

frozen-due-to-age
[clustering] failure to discover peers when the number of instances and exposed ports is large #1208

7 of 7

frozen-due-to-age pir-action-item
[stretch] explore options for clustering support for push-based workflows
[stretch] New highly-available cluster architecture POC
[follow-up] make cluster improvements GA: Graduate the improvements to clustering to GA once proven #1274
[follow-up] verify we address this: alloy metrics pod crashes with alloy v1.2.1 when using prometheus.operator.servicemonitors config #1349
[new/stretch] don't admit traffic until cluster is sufficient size: Proposal: Only attempt to send metrics after joining a sufficiently large cluster #201
[stretch] Fine-grained component scheduling: clustering: allow cluster-aware component scheduling #399
[stretch] When using clustering, exporters may not work correctly due to instance label #1009
Options

The text was updated successfully, but these errors were encountered:

diguardiag · 2024-05-08T21:14:27Z

I am noticing this behaviour on a k8s cluster (~1800 pods), with an alloy cluster of 3, Istio present, and pod autodiscovery enabled.

Pods have 13GB Memory each (upper limit)
Peers change constantly (but they are always up and running) (screen below)

We're experiencing:

metrics data loss
a lot of lines with Dropped sample for series that was not explicitly dropped via relabelling
cluster pods not agreeing on who is part of the cluster (see screen)

gowtham-sundara · 2024-05-08T23:40:50Z

I noticed an issue today where one of the pods fell out of the clustering, it's present in discovery but none of the alloy pods actually scrape them. This didn't go away over a long period of time so I am not sure if it's related to #1 mentioned.

christopher-wong · 2024-05-16T23:25:32Z

This is my Helm configuration for my deployment of Alloy (24 nodes, so 24 Alloy pods). When I enable clustering, alloy.clustering.enabled = true, metrics stop being scraped altogether.

alloy:
  configMap:
    content: |-
      prometheus.remote_write "default" {
        endpoint {
          url = "http://mimir-gateway.monitoring.svc:80/api/v1/push"
        }
      }

      prometheus.operator.servicemonitors "services" {
        forward_to = [prometheus.remote_write.default.receiver]

        clustering {
          enabled = true
        }
      }

      prometheus.operator.podmonitors "pods" {
        forward_to = [prometheus.remote_write.default.receiver]

        clustering {
          enabled = true
        }
      }
  clustering:
    enabled: false      
  resources:
    requests:
      cpu: 100m
      memory: 2Gi
    limits:
      cpu: 1.5
      memory: 12Gi
configReloader:
  resources:
    requests:
      cpu: "1m"
      memory: "5Mi"
    limits:
      cpu: 10m
      memory: 10Mi

itjobs-levi · 2024-05-22T08:18:38Z

I have Alloy Agent replica 3 ea (CPU: 1000m / Memory: 4Gi)
Enable clustering mode (both scrape exporters)
Configure to scrape the unix exporter and process exporter on about 200 servers at one-minute intervals.
When scraping, many errors such as err-mimir-duplicate-label-names occur in mimir.
In the Grafana mimir document, err-mimir-duplicate-label-names appears to be a problem caused by existing records.
I think this is caused by cluster splitting job load balancing.

The first is that this feels like a load, is that correct?
If it's not a secondary load, is it possible to turn off these logs?

thampiotr · 2024-06-06T08:55:25Z

@itjobs-levi @christopher-wong @gowtham-sundara @diguardiag could you open issues for these and provide clear steps to reproduce? These may need to be looked into separately.

thampiotr · 2024-11-21T12:03:01Z

Closing this as done. The stretch goals are left as issues in their own rights.

thampiotr added the enhancement New feature or request label May 7, 2024

thampiotr self-assigned this May 7, 2024

itjobs-levi mentioned this issue Jun 10, 2024

When scraping using Alloy clustering mode, if there are more than 3 replicas, a duplicate label error occurs. #1006

Closed

thampiotr mentioned this issue Jul 12, 2024

Graduate the improvements to clustering to GA once proven #1274

Closed

thampiotr closed this as completed Nov 21, 2024

github-actions bot added the frozen-due-to-age label Dec 22, 2024

github-actions bot locked as resolved and limited conversation to collaborators Dec 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking: Address Clustering Issues #784

Tracking: Address Clustering Issues #784

thampiotr commented May 7, 2024 •

edited

Loading

Tasks

diguardiag commented May 8, 2024 •

edited

Loading

gowtham-sundara commented May 8, 2024

christopher-wong commented May 16, 2024

itjobs-levi commented May 22, 2024

thampiotr commented Jun 6, 2024

thampiotr commented Nov 21, 2024

Tracking: Address Clustering Issues #784

Tracking: Address Clustering Issues #784

Comments

thampiotr commented May 7, 2024 • edited Loading

Request

Use case

Tasks

diguardiag commented May 8, 2024 • edited Loading

gowtham-sundara commented May 8, 2024

christopher-wong commented May 16, 2024

itjobs-levi commented May 22, 2024

thampiotr commented Jun 6, 2024

thampiotr commented Nov 21, 2024

thampiotr commented May 7, 2024 •

edited

Loading

diguardiag commented May 8, 2024 •

edited

Loading