-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metric timestamps should be passed through #24
Comments
This might be one of the cases where exposing a timestamp is legitimate. Client libraries don't support that yet (for reasons…). I filed prometheus/client_golang#187 to consider during my ongoing |
Meeting the same issue here. Looks like supporting timestamp in client_golang is not a good choice. |
I've read through this issue, prometheus/client_golang#187 and prometheus/prometheus#398 and I'm still unclear on what the problems are when the timestamp is set at the source. Could you point me in the right direction (issues, docs, tweets, whatever you have)? This comment on #187 suggests using the push gateway without timestamps for telemetry data (which collectd data closely resembles, IMHO) and predicts mayhem if timestamps are set – without going into any kind of detail. I don't think this is a good solution for collectd_exporter: There is buffering in both, the write_http plugin and the network plugin (the plugins used to send metrics to collectd_exporter), meaning that there are metrics that stay in the buffer until the next interval / "scrape". Which metrics those are is nondeterministic, i.e. this could be a different metric each time. It is normal to receive an update for a metric in the first push, then none in the second and then two updates almost at the same time in the third interval. In other words, it is normal that the time difference between updates for a metric received by the push gateway is between 0% and 200% of the interval at which the metric is read / "scraped". |
Users expect it to Just Work, when in reality staleness handling makes it not work out. We have seen numerous users setting timestamps and running into problems (indeed it's one of our standard questions now when a user is getting odd behaviour). The collectd_exporter is only intended as a transition mechanism, and there is no 100% correct way to handle it as there's no correct way to convert a push system to a pull system. If this is a major problem for you, I'd suggest expediting your plans to move on to a full-on standard Prometheus solution. |
@brian-brazil , I am wondering why pushgateway supports timestamp, but collectd_exporter doesn't support it. In my understanding, collectd_exporter should works like pushgateway, the only difference is that it only speaks collectd language. right? |
There's historical reasons there, it's a little before my time with the project but it appears it took a while to realise that setting timestamps was a bad idea (plus at least one other anti-pattern). It will be needed in the future for some advanced use cases, though even when staleness is fixed general users should not push timestamps to the pushgateway. |
I am sympathetic to your goal of minimizing the support burden, I know that song well. However, it is not "just working" which is why we're having this discussion.
Can you point me at concrete examples? I'd like to understand the nature and scope of these problems. If this is such a common problem, have you considered writing an FAQ entry? For context, I have a WiP that currently does export the timestamp and I believe this is the correct approach. I'd like to understand your concerns and potential limitations before going forward with this, though.
Not true, it's not even all that hard. There are some problems when the poll rate is larger than the (push based) metrics, but that can easily be worked around by allowing multiple metric updates per scrape. If this is hard, then it's hard due to Prometheus' implementation, not because it's a hard problem. IMHO the core of the problem is that Prometheus makes the assumption that it controls the scraping time and rate, which is only true for the simplest use case, namely scraping an instrumented application. As soon as you proxy metrics, e.g. a global Prometheus scrapes multiple local Prometheus, this assumption breaks down and you're having a bad time. (insert alien skiing teacher)
Hahaha, yeah, not going to happen – I'm happy to contribute to Prometheus every now and then, but that's asking too much. ;-) Tell you what, though: if you take the time to explain those problems to me, I'll take the time and send you a PR adding the answer to the FAQ. |
I don't see why a user doing something that's explicitly not recommended and which doesn't then work is a problem.
It's discussed on https://github.com/prometheus/pushgateway/#about-timestamps
The concern is if we allow that to be possible, it'll cause far more harm than good. Even with us not currently supporting this and actively discouraging it it's already causing us a continuous stream of problems.
This is a core assumption of Prometheus and a completely reasonable one to make in our architecture for both instrumented and non-instrumented applications.
That's easy in theory, but is actually quite risky in reliability terms. This is a hard problem, and in line with our general principles we have elected for simplicity and to accept a small amount of data loss rather than risk a cascading overload.
The assumption still holds there, but you've other (unsolvable) problems then. Proxying metrics is perfectly fine and supported.
We currently have no general documentation on prometheus.io even mention using timestamps (it's only in the spec for the exposition format), and yet users try. I think adding a FAQ would make that worse by making more users aware that it's even possible, and then them trying to misguidedly use it and failing.
This is the major misuse of timestamps we see with the pushgateway and is not a supported use case. See https://prometheus.io/docs/practices/pushing/. Fundamentally Prometheus is a pull-based system designed for reliability over 100% correctness. It's easy to convert pull to push, but not the other way around. If this is a major problem for you you may want to consider a different monitoring system, as you'll be swimming upstream trying to make Prometheus do push. |
Hi @brian-brazil, thank you very much for your reply!
This issue is not about users doing something wrong. We're talking about a restart of Prometheus causing incorrect rates and if sending the metric timestamps from collectd_exporter could work around this issue for collectd_exporter. It's about giving users the "it just works" experience.
To (mostly) work around this issue, two features are needed:
Arguably, collectd_exporter should craft the ProtoBufs itself since a client library geared towards application instrumentation is not a good fit for this use-case. Each metric received from collectd includes the interval in which you can expect updates. Typically this interval is much shorter than 5 minutes and defaults to 10 seconds. When there hasn't been an update for a while (collectd defaults to 2×interval), collectd_exporter can remove the metric from its in-memory store. The pushgateway is unable to do this, because it doesn't know in which interval to expect metric updates. I have made the point that storing the interval alongside the timestamp is a good idea already, so I won't press it again. Example: Assume, a metric is received with time Caveat: This doesn't work for the (less common) case in which the interval of the metric exceeds the staleness threshold of Prometheus. There are a couple of possible behaviors for collectd_exporter. It could
All of these require the user to configure collectd_exporter to match the "staleness threshold" of Prometheus, which is unfortunately unavoidable with the current API. Best regards, |
As the main constraint is Prometheus's staleness handling (discussed in prometheus/prometheus#398), I suggest that it should be made explicit by adding a Functionally, there's little difference between dropping the data in collectd_exporter and dropping it in Prometheus proper. For clarity of operation, it's important that stale values are dropped from collectd_exporter's own If I read the thread correctly, the remaining concerns are: the difficulty of modifying collectd_exporter to add timestamp passthrough given its current implementation choices; and the general danger that any improvement will tend to entrench the idea that collectd_exporter is more than a transitional tool. Did I miss anything? |
It already does that, as collectd provides sufficient information to to that automatically (other exporters like graphtie and influx we need the flag). |
Sorry for late reply. Just reached the end of my list of "urgent things to do"... ;)
The client library is meant to support both use cases: instrumenting applications, and writing exporters. A general wild thought: If collectd is a system that regularly updates metrics values, it is not too different from a Prometheus server, and what the collectd_exporter is doing, could be seen as similar to what a Prometheus is doing that another Prometheus is federating from. Same as we are setting timestamps on federated metrics, we could set timestamps on metrics exported by collectd_exporter, with the premise that stale metrics are not exported anymore, same as a Prometheus doesn't return stale metrics when federated from. |
Collectd supplies Posix timestamps, and Prometheus understands them in the exposition formats. However, collectd_exporter exports all metrics with the implied "now" timestamp. This leads to calculation errors. For example, I recently had a normally-1minly counter which was scraped only 52s after the preceding scrape (due to a prometheus restart). The underlying counter values were time-stamped 1 minute apart in the source collectd, but Prometheus thought they were only 52s apart, so it computed the rate incorrectly. The result was a spike to over 110% CPU, apparently.
The text was updated successfully, but these errors were encountered: