You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We're using prometheus.write.queue for a persistent metrics gathering and sending them to a remote storage once the device is online (as in, has internet).
We want to have an overview whether Alloy is successfully able to send all of the metrics and to have a metrics that shows how many samples are there that are collected but not yet sent to a remote storage (important to note that this metrics shouldn't be a counter which means how many samples were gathered and saved to WAL, but a real number or estimate at least of samples that were saved to WAL, but not sent, in order to properly see it even when Alloy restarts).
For us it's crucial as we need to understand how much of the samples are scraped, but not yet sent, as we plan to use it as a base for alerting and observation.
prometheus_remote_storage_samples_pending (gauge): The number of samples pending in shards to be sent to remote storage.
and it seems to be something that we want, but apparently it's somehow not being returned by Alloy meta-monitoring (I cannot find it in Grafana using Prometheus this Alloy is writing to as a datasource).
Is it supposed to be there?
If not I suggest to update the docs to remove it and add a metric that displays the amount of samples that are stored, but not yet sent (probably a gauge as it also can decrease, unlike counter).
If yes, should there be any additional steps to enable it? (Probably need to put it into docs if so).
Steps to reproduce
Run Alloy with the config similar to what I provided below.
See no prometheus_remote_storage_samples_pending metric being written into Prometheus.
System information
No response
Software version
1.5.0
Configuration
logging {
level = "info"
format = "logfmt"
}
// System metrics, from node_exporter
prometheus.exporter.unix "node_exporter" { }
// Alloy built-in metrics
prometheus.exporter.self "alloy" { }
// CAdvisor
prometheus.exporter.cadvisor "cadvisor" {
docker_host = "unix:///var/run/docker.sock"
}
// Metrics scrape configuration
prometheus.scrape "node_exporter" {
targets = array.concat(
// Scraping node_exporter
prometheus.exporter.unix.node_exporter.targets,
// Scraping Alloy built-in metrics
prometheus.exporter.self.alloy.targets,
// Scraping CAdvisor metrics
prometheus.exporter.cadvisor.cadvisor.targets,
)
scrape_interval = "60s"
honor_labels = true
// Sending these scraped metrics to remote Prometheus via prometheus.write.queue.
forward_to = [prometheus.write.queue.default.receiver]
}
prometheus.write.queue "default" {
endpoint "default"{
url = env("PROMETHEUS_HOST")
bearer_token = env("PROMETHEUS_TOKEN")
}
// Keep 1 week of data, in case it wasn't sent.
// More on WAL and its internals:
// https://grafana.com/docs/alloy/latest/reference/components/prometheus/prometheus.remote_write/#wal-block
ttl = "168h"
persistence {
batch_interval = "10s"
}
}
Logs
The text was updated successfully, but these errors were encountered:
At the moment this would be a documentation issue. Will submit a PR to remove that. Right now checking the in vs out timestamp is likely the best rate. Adding a more bespoke metric probably makes sense, but would need to figure out how to not make it to chatty.
Yeah we'll use the metric that you suggested for now, but ideally it should be something that reflects how much samples there are that are not yet sent to the remote storage.
So just stating that this is a metric that we want to be available and we need it (and likely we won't be the only ones who need it), so would be lovely if that would be implemented at some point.
Also I am not yet sure if the samples are always sent in order (as in, the older sample always gets sent first than the newer one)? If not, the timestamps difference metric might report the wrong results I guess.
The guarantee is that for a given series they are sent in timestamped order, in practice the timestamps are close enough. Generally I would consider anything under a minute lag time as good, though you could likely shave that a bit lower depending on your use case.
I concur that it would be good to have though. If you want to submit a PR it would be welcome github.com/alloy/walqueue, I will be unavailable for a good chunk of the rest of the year but if its still outstanding when I get back I will code it in.
What's wrong?
We're using prometheus.write.queue for a persistent metrics gathering and sending them to a remote storage once the device is online (as in, has internet).
We want to have an overview whether Alloy is successfully able to send all of the metrics and to have a metrics that shows how many samples are there that are collected but not yet sent to a remote storage (important to note that this metrics shouldn't be a counter which means how many samples were gathered and saved to WAL, but a real number or estimate at least of samples that were saved to WAL, but not sent, in order to properly see it even when Alloy restarts).
For us it's crucial as we need to understand how much of the samples are scraped, but not yet sent, as we plan to use it as a base for alerting and observation.
The docs are saying that there's this metric:
prometheus_remote_storage_samples_pending (gauge): The number of samples pending in shards to be sent to remote storage.
and it seems to be something that we want, but apparently it's somehow not being returned by Alloy meta-monitoring (I cannot find it in Grafana using Prometheus this Alloy is writing to as a datasource).
Is it supposed to be there?
If not I suggest to update the docs to remove it and add a metric that displays the amount of samples that are stored, but not yet sent (probably a gauge as it also can decrease, unlike counter).
If yes, should there be any additional steps to enable it? (Probably need to put it into docs if so).
Steps to reproduce
prometheus_remote_storage_samples_pending
metric being written into Prometheus.System information
No response
Software version
1.5.0
Configuration
Logs
The text was updated successfully, but these errors were encountered: