-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking: Address Clustering Issues #784
Comments
I am noticing this behaviour on a k8s cluster (~1800 pods), with an alloy cluster of 3, Istio present, and pod autodiscovery enabled.
We're experiencing: |
I noticed an issue today where one of the pods fell out of the clustering, it's present in discovery but none of the alloy pods actually scrape them. This didn't go away over a long period of time so I am not sure if it's related to #1 mentioned. |
This is my Helm configuration for my deployment of Alloy (24 nodes, so 24 Alloy pods). When I enable clustering, alloy:
configMap:
content: |-
prometheus.remote_write "default" {
endpoint {
url = "http://mimir-gateway.monitoring.svc:80/api/v1/push"
}
}
prometheus.operator.servicemonitors "services" {
forward_to = [prometheus.remote_write.default.receiver]
clustering {
enabled = true
}
}
prometheus.operator.podmonitors "pods" {
forward_to = [prometheus.remote_write.default.receiver]
clustering {
enabled = true
}
}
clustering:
enabled: false
resources:
requests:
cpu: 100m
memory: 2Gi
limits:
cpu: 1.5
memory: 12Gi
configReloader:
resources:
requests:
cpu: "1m"
memory: "5Mi"
limits:
cpu: 10m
memory: 10Mi |
I have Alloy Agent replica 3 ea (CPU: 1000m / Memory: 4Gi) The first is that this feels like a load, is that correct? |
@itjobs-levi @christopher-wong @gowtham-sundara @diguardiag could you open issues for these and provide clear steps to reproduce? These may need to be looked into separately. |
Closing this as done. The stretch goals are left as issues in their own rights. |
Request
There are a few issues that users report and we're observing, that can lead to data issues: 1) there can be gaps in metrics under some circumstances when instances join cluster, 2) there can be elevated errors and alerts when writing to TSDB in some cases, 3) there can be duplicated metrics in other cases.
The extent of these issues is not high, but since these are potential data loss issues, we want to address them and fully understand the problem.
Use case
Sending data should not be dropped.
Tasks
The text was updated successfully, but these errors were encountered: