Reject inconsistent metrics at push time (and create a synthetic metric about it) #202

fvigotti · 2018-09-13T14:29:52Z

and instead it should log the malformed data that has received and then discard it..
thre is already an issue in collectd exporter.. and I've already discussed it there
prometheus/collectd_exporter#56 (comment)

I have no skill in go and zero time to help, sorry :(

beorn7 · 2018-09-13T23:26:07Z

If this is about inconsistent label dimensions: They should be fine with the new client library, see prometheus/client_golang#417 . I'll release a new Pushgateway soon, using the client library with this change.

However, if the metric is problematic in other ways, the behavior will still be the same, i.e. the push will succeed, but the scrape not. That's mostly for performance reasons, checking on each push has O(n²) runtime. In practice, it might not be that bad, but that has to be explored. The change in the code would be quite invasive.

fvigotti · 2018-09-14T06:58:23Z

Hi @beorn7 , In understand that could be quite invasive , but having the whole application fail to return metrics ( maybe some thousands of metrics ) because one of hundreds endpoint sent an invalid one it's a wrong behaviour, if you or other pushgateway developer are worried about performance or being too invasive change, let's user enable it with a arg-flag...
I don't know the internals.. but O(n²) seems a lot to me.. ther eare no way to check validity of incoming data which is cheaper ?

I'll test the new version when is out and let you know anyway :)

beorn7 · 2018-09-14T09:36:45Z

First of all, if you have thousands of metrics on your PGW, the odds are you are using it wrong. See https://prometheus.io/docs/practices/pushing/ .

The PGW is fairly light-weight. In general, I would recommend to us one per team or service needing it. In that way, collisions of metrics are less likely, and a failure won't affect larger groups.

About the O(n²): This is just in the nature of the problem. Checking if the total exposition the PGW would offer to scrape is consistent is broadly proportional to the number of metrics pushed. Also, the check has to be performed on each push. Thus, if you double the number of pushers, you double the number of checks, and each check takes twice as long as double the metrics are pushed. Computational effort is now increased by a factor of four. One might improve this with some hash tables tracking the already pushed metrics, but it makes the code even more involved.

In any case, a rejected push needed to be represented in the scrape in some way. You must be able to alert on it.

beorn7 · 2018-09-14T14:12:07Z

#94 is related but more concerned with the way how to handle the case if inconsistent metrics have made it into the Pushgateway. If this issue results in not letting this happen in the first place, #94 would be obsolete.

ananth-at-camphor-networks · 2019-08-04T04:42:01Z

Is there any update on this ? I tried with the latest version available now (0.9.1) but get the same set of errors. I do not understand the logic at all as to why /metrics get has to fail... it is the push that should fail, not get. Other option is to provide a get end-point. If so, then clients can call get before attempting the push... (I still don't udnerstand what O(n^2) referred to here..). Please do not fail GET on /metrics. Scraping will stop working due to this issue (and it was working few months ago)

beorn7 · 2019-08-12T17:26:41Z

The state is that the problem is well understood and a solution is sketched out in my head.
Actually implementing it is pretty high up on my list. This may mean it will happen this week, but I cannot promise anything.

This is finally a solution to #202 (and as a byproduct to #94). The original wisdom used to be that checking pushed metrics during the push should not be done because it is inherently an O(n*m) problem: n is the number of metrics in the PGW, and m is the number of pushes in a certain time-frame. Upon each push, a check has to run, and that check has a cost linear with the total number of metrics. Thus, if you scale up something by a factor of 2, you can expect twice as many pushers pushing in total twice as many metrics, in twice as many push operations, resulting in 2*2=4 times the cost of checking metrics. Since we didn't want to let pushers wait for a potentially expensive check, we just accepted the push, as long as it was parseable, and deferred the error detection to scrapes. The result was that a single inconsistent push would result in an error when scraping the PGW. The bright side here is that the problem will be spotted quite easily, and there is a strong incentive to fix it. The problematic part is that a single rogue pusher is affecting all systems that depend on the same PGW, possibly at inconvenient times. In sane use cases for the PGW, scrapes are happening more often than pushes. Let's say a database backup pushes its success or failure outcome once per day. Scrapes happen a couple of times per minute. While the whole O(n*m) calculation above is not wrong, the reality is that the scrapes will inflict the cost anyway. If the check is too expensive per push, it will be even more so per scrape. Also, pushers shouldn't be too sensitive to slightly enlarged push times. Finally, if a PGW is so heavily loaded with metrics that the check actually takes prohibitively long, we can be quite sure that this PGW is abused for something it was not meant for (like turning Prometheus into a push-based system). The solution is conceptually quite easy: On each push, first feed the push into a copy of the current metrics state (which has a certain cost but it is anyway done whenever somebody looks at the web UI) and then simulate a scrape. In that way, we can reuse the validation power of prometheus/client_golang. We don't even need to rearchitecture the whole framework of queueing pushes and deletions. We only need to talk back via an optional channel. Pushes will now get a 200 or a 400 rathen than the "(almost) always 202" response we had before. Easy, isn't it? Unfortunately, the easy change became _way_ more invasive than I anticipated. Here is an overview of the changes in this commit (also see the changes in README.md, which explain the usage side of things): - Previously, the PUT request was executed in two parts: First, delete all metrics in the group, then add the newly pushed metrics to the group. That's now a problem. If the 2nd part fails, it will leave behind an empty group rather than the state before the whole PUT request, as the user will certainly expect. Thus, the operation had to be made atomic. To accomplish that, the `storage.WriteRequest` now has an additional field `Replace` to mark replacement of the whole group if true. This field is in addition to the back channel mentioned above (called `Done` in the `WriteRequest`). - To enable alerting on failed pushes, there is now a new metric `push_failure_time_seconds` with the timestamp of the last failed push (or 0 if there never was a failed push). Ideally, we would rename the existing metric `push_time_seconds` to `push_success_time_seconds`, but that would break all existing uses of that metric. Thus, I left the name as is, although `push_time_seconds` is only updated upon a successful push. (This allows alerting on `push_failure_time_seconds > push_time_seconds` and tracks at the same time the last successful push.) There are some subtle aspects of implementing this correctly, mostly around the problem that a successful push should leave a pre-existing failure timestamp intact and vice versa. - Previously, the push handler code already had a bunch of validation and sanitation code (block timestamps, sanitize labels) and also added the `push_time_seconds` metric. Now that the storage layer is doing the consistency check for the pushed metric, it is way cleaner to have all the other validation and sanitation code there as well. (Arguably, it was already wrong to have too much logic in the push handler.) Also, the decision needed to properly set the `push_time_seconds` and `push_faiure_time_seconds` metrics can now only be made after the consistency check, so those metrics have to be set in the storage layer anyway. This change of responsibility also changed the tests quite a bit. The push handler tests now test only that the right method calls happen to the storage, while the many tests for validation, consistency checks, adding the push timestamp metrics, and proper sanitation are now part of the storage tests. (Digression: One might argue that the cleanest solution is a separate validation layer in front of a dumb storage layer. Since the only implementation of the storage for now is the disk storage (and a mock storage for testing), I decided to not make that split prematurely.) - Logging of rejected pushes happens ot error level now. Strictly speaking, it is just a user error and not an error of the PGW itself, if it is happening. But experience tells us that users first check the logs, and they usualyl don't run on debug or even info level. I hope this will avoid a whole lot of user support questions. Groups where the last push has failed are also marked in the web UI. Signed-off-by: beorn7 <[email protected]>

ananth-at-camphor-networks · 2019-09-19T21:44:03Z

Thanks so much. I very much appreciate it!

This is finally a solution to #202 (and as a byproduct to #94). The original wisdom used to be that checking pushed metrics during the push should not be done because it is inherently an O(n*m) problem: n is the number of metrics in the PGW, and m is the number of pushes in a certain time-frame. Upon each push, a check has to run, and that check has a cost linear with the total number of metrics. Thus, if you scale up something by a factor of 2, you can expect twice as many pushers pushing in total twice as many metrics, in twice as many push operations, resulting in 2*2=4 times the cost of checking metrics. Since we didn't want to let pushers wait for a potentially expensive check, we just accepted the push, as long as it was parseable, and deferred the error detection to scrapes. The result was that a single inconsistent push would result in an error when scraping the PGW. The bright side here is that the problem will be spotted quite easily, and there is a strong incentive to fix it. The problematic part is that a single rogue pusher is affecting all systems that depend on the same PGW, possibly at inconvenient times. In sane use cases for the PGW, scrapes are happening more often than pushes. Let's say a database backup pushes its success or failure outcome once per day. Scrapes happen a couple of times per minute. While the whole O(n*m) calculation above is not wrong, the reality is that the scrapes will inflict the cost anyway. If the check is too expensive per push, it will be even more so per scrape. Also, pushers shouldn't be too sensitive to slightly enlarged push times. Finally, if a PGW is so heavily loaded with metrics that the check actually takes prohibitively long, we can be quite sure that this PGW is abused for something it was not meant for (like turning Prometheus into a push-based system). The solution is conceptually quite easy: On each push, first feed the push into a copy of the current metrics state (which has a certain cost but it is anyway done whenever somebody looks at the web UI) and then simulate a scrape. In that way, we can reuse the validation power of prometheus/client_golang. We don't even need to rearchitecture the whole framework of queueing pushes and deletions. We only need to talk back via an optional channel. Pushes will now get a 200 or a 400 rathen than the "(almost) always 202" response we had before. Easy, isn't it? Unfortunately, the easy change became _way_ more invasive than I anticipated. Here is an overview of the changes in this commit (also see the changes in README.md, which explain the usage side of things): - Previously, the PUT request was executed in two parts: First, delete all metrics in the group, then add the newly pushed metrics to the group. That's now a problem. If the 2nd part fails, it will leave behind an empty group rather than the state before the whole PUT request, as the user will certainly expect. Thus, the operation had to be made atomic. To accomplish that, the `storage.WriteRequest` now has an additional field `Replace` to mark replacement of the whole group if true. This field is in addition to the back channel mentioned above (called `Done` in the `WriteRequest`). - To enable alerting on failed pushes, there is now a new metric `push_failure_time_seconds` with the timestamp of the last failed push (or 0 if there never was a failed push). Ideally, we would rename the existing metric `push_time_seconds` to `push_success_time_seconds`, but that would break all existing uses of that metric. Thus, I left the name as is, although `push_time_seconds` is only updated upon a successful push. (This allows alerting on `push_failure_time_seconds > push_time_seconds` and tracks at the same time the last successful push.) There are some subtle aspects of implementing this correctly, mostly around the problem that a successful push should leave a pre-existing failure timestamp intact and vice versa. - Previously, the push handler code already had a bunch of validation and sanitation code (block timestamps, sanitize labels) and also added the `push_time_seconds` metric. Now that the storage layer is doing the consistency check for the pushed metric, it is way cleaner to have all the other validation and sanitation code there as well. (Arguably, it was already wrong to have too much logic in the push handler.) Also, the decision needed to properly set the `push_time_seconds` and `push_faiure_time_seconds` metrics can now only be made after the consistency check, so those metrics have to be set in the storage layer anyway. This change of responsibility also changed the tests quite a bit. The push handler tests now test only that the right method calls happen to the storage, while the many tests for validation, consistency checks, adding the push timestamp metrics, and proper sanitation are now part of the storage tests. (Digression: One might argue that the cleanest solution is a separate validation layer in front of a dumb storage layer. Since the only implementation of the storage for now is the disk storage (and a mock storage for testing), I decided to not make that split prematurely.) - Logging of rejected pushes happens ot error level now. Strictly speaking, it is just a user error and not an error of the PGW itself, if it is happening. But experience tells us that users first check the logs, and they usualyl don't run on debug or even info level. I hope this will avoid a whole lot of user support questions. Groups where the last push has failed are also marked in the web UI. Signed-off-by: beorn7 <[email protected]>

beorn7 · 2019-09-20T16:38:59Z

Fix is in master. Will be released once #287 is done. But please feel free to build from master and try it out. It's a fairly invasive change, and some pre-release feedback will be appreciated.

beorn7 changed the title ~~problem : pushgateway return error if some metrics is wrongly formatted..~~ Reject inconsistent metrics at push time (and create a synthetic metric about it) Sep 14, 2018

beorn7 added the feature request label Sep 14, 2018

beorn7 mentioned this issue Sep 18, 2018

Release 1.0 #175

Closed

9 tasks

beorn7 mentioned this issue Oct 11, 2018

A simple way to make PushGateway always return 500 #214

Closed

beorn7 added the not-as-easy-as-it-looks label Oct 17, 2018

beorn7 mentioned this issue Dec 18, 2018

An error shows that pushgateway collected a metric with same name and same label values #232

Closed

beorn7 self-assigned this Apr 12, 2019

beorn7 mentioned this issue Sep 19, 2019

Check pushed metrics immediately and reject them if inconsistent #290

Merged

beorn7 closed this as completed Sep 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reject inconsistent metrics at push time (and create a synthetic metric about it) #202

Reject inconsistent metrics at push time (and create a synthetic metric about it) #202

fvigotti commented Sep 13, 2018

beorn7 commented Sep 13, 2018

fvigotti commented Sep 14, 2018 •

edited

Loading

beorn7 commented Sep 14, 2018

beorn7 commented Sep 14, 2018

ananth-at-camphor-networks commented Aug 4, 2019

beorn7 commented Aug 12, 2019

ananth-at-camphor-networks commented Sep 19, 2019

beorn7 commented Sep 20, 2019

Reject inconsistent metrics at push time (and create a synthetic metric about it) #202

Reject inconsistent metrics at push time (and create a synthetic metric about it) #202

Comments

fvigotti commented Sep 13, 2018

beorn7 commented Sep 13, 2018

fvigotti commented Sep 14, 2018 • edited Loading

beorn7 commented Sep 14, 2018

beorn7 commented Sep 14, 2018

ananth-at-camphor-networks commented Aug 4, 2019

beorn7 commented Aug 12, 2019

ananth-at-camphor-networks commented Sep 19, 2019

beorn7 commented Sep 20, 2019

fvigotti commented Sep 14, 2018 •

edited

Loading