Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reject inconsistent metrics at push time (and create a synthetic metric about it) #202

Closed
fvigotti opened this issue Sep 13, 2018 · 8 comments

Comments

@fvigotti
Copy link

and instead it should log the malformed data that has received and then discard it..
thre is already an issue in collectd exporter.. and I've already discussed it there
prometheus/collectd_exporter#56 (comment)

I have no skill in go and zero time to help, sorry :(

@beorn7
Copy link
Member

beorn7 commented Sep 13, 2018

If this is about inconsistent label dimensions: They should be fine with the new client library, see prometheus/client_golang#417 . I'll release a new Pushgateway soon, using the client library with this change.

However, if the metric is problematic in other ways, the behavior will still be the same, i.e. the push will succeed, but the scrape not. That's mostly for performance reasons, checking on each push has O(n²) runtime. In practice, it might not be that bad, but that has to be explored. The change in the code would be quite invasive.

@fvigotti
Copy link
Author

fvigotti commented Sep 14, 2018

Hi @beorn7 , In understand that could be quite invasive , but having the whole application fail to return metrics ( maybe some thousands of metrics ) because one of hundreds endpoint sent an invalid one it's a wrong behaviour, if you or other pushgateway developer are worried about performance or being too invasive change, let's user enable it with a arg-flag...
I don't know the internals.. but O(n²) seems a lot to me.. ther eare no way to check validity of incoming data which is cheaper ?

I'll test the new version when is out and let you know anyway :)

@beorn7
Copy link
Member

beorn7 commented Sep 14, 2018

First of all, if you have thousands of metrics on your PGW, the odds are you are using it wrong. See https://prometheus.io/docs/practices/pushing/ .

The PGW is fairly light-weight. In general, I would recommend to us one per team or service needing it. In that way, collisions of metrics are less likely, and a failure won't affect larger groups.

About the O(n²): This is just in the nature of the problem. Checking if the total exposition the PGW would offer to scrape is consistent is broadly proportional to the number of metrics pushed. Also, the check has to be performed on each push. Thus, if you double the number of pushers, you double the number of checks, and each check takes twice as long as double the metrics are pushed. Computational effort is now increased by a factor of four. One might improve this with some hash tables tracking the already pushed metrics, but it makes the code even more involved.

In any case, a rejected push needed to be represented in the scrape in some way. You must be able to alert on it.

@beorn7 beorn7 changed the title problem : pushgateway return error if some metrics is wrongly formatted.. Reject inconsistent metrics at push time (and create a synthetic metric about it) Sep 14, 2018
@beorn7
Copy link
Member

beorn7 commented Sep 14, 2018

#94 is related but more concerned with the way how to handle the case if inconsistent metrics have made it into the Pushgateway. If this issue results in not letting this happen in the first place, #94 would be obsolete.

@ananth-at-camphor-networks

Is there any update on this ? I tried with the latest version available now (0.9.1) but get the same set of errors. I do not understand the logic at all as to why /metrics get has to fail... it is the push that should fail, not get. Other option is to provide a get end-point. If so, then clients can call get before attempting the push... (I still don't udnerstand what O(n^2) referred to here..). Please do not fail GET on /metrics. Scraping will stop working due to this issue (and it was working few months ago)

@beorn7
Copy link
Member

beorn7 commented Aug 12, 2019

The state is that the problem is well understood and a solution is sketched out in my head.
Actually implementing it is pretty high up on my list. This may mean it will happen this week, but I cannot promise anything.

beorn7 added a commit that referenced this issue Sep 19, 2019
This is finally a solution to #202 (and as a byproduct to #94).

The original wisdom used to be that checking pushed metrics during the
push should not be done because it is inherently an O(n*m) problem: n
is the number of metrics in the PGW, and m is the number of pushes in
a certain time-frame. Upon each push, a check has to run, and that
check has a cost linear with the total number of metrics. Thus, if you
scale up something by a factor of 2, you can expect twice as many
pushers pushing in total twice as many metrics, in twice as many push
operations, resulting in 2*2=4 times the cost of checking
metrics. Since we didn't want to let pushers wait for a potentially
expensive check, we just accepted the push, as long as it was
parseable, and deferred the error detection to scrapes.

The result was that a single inconsistent push would result in an
error when scraping the PGW. The bright side here is that the problem
will be spotted quite easily, and there is a strong incentive to fix
it. The problematic part is that a single rogue pusher is affecting
all systems that depend on the same PGW, possibly at inconvenient
times.

In sane use cases for the PGW, scrapes are happening more often than
pushes. Let's say a database backup pushes its success or failure
outcome once per day. Scrapes happen a couple of times per
minute. While the whole O(n*m) calculation above is not wrong, the
reality is that the scrapes will inflict the cost anyway. If the check
is too expensive per push, it will be even more so per scrape. Also,
pushers shouldn't be too sensitive to slightly enlarged push
times. Finally, if a PGW is so heavily loaded with metrics that the
check actually takes prohibitively long, we can be quite sure that
this PGW is abused for something it was not meant for (like turning
Prometheus into a push-based system).

The solution is conceptually quite easy: On each push, first feed the
push into a copy of the current metrics state (which has a certain
cost but it is anyway done whenever somebody looks at the web UI) and
then simulate a scrape. In that way, we can reuse the validation power
of prometheus/client_golang. We don't even need to rearchitecture the
whole framework of queueing pushes and deletions. We only need to talk
back via an optional channel. Pushes will now get a 200 or a 400
rathen than the "(almost) always 202" response we had before.

Easy, isn't it?

Unfortunately, the easy change became _way_ more invasive than I
anticipated. Here is an overview of the changes in this commit (also
see the changes in README.md, which explain the usage side of things):

- Previously, the PUT request was executed in two parts: First, delete
  all metrics in the group, then add the newly pushed metrics to the
  group. That's now a problem. If the 2nd part fails, it will leave
  behind an empty group rather than the state before the whole PUT
  request, as the user will certainly expect. Thus, the operation had
  to be made atomic. To accomplish that, the `storage.WriteRequest`
  now has an additional field `Replace` to mark replacement of the
  whole group if true. This field is in addition to the back channel
  mentioned above (called `Done` in the `WriteRequest`).

- To enable alerting on failed pushes, there is now a new metric
  `push_failure_time_seconds` with the timestamp of the last failed
  push (or 0 if there never was a failed push). Ideally, we would
  rename the existing metric `push_time_seconds` to
  `push_success_time_seconds`, but that would break all existing uses
  of that metric. Thus, I left the name as is, although
  `push_time_seconds` is only updated upon a successful push. (This
  allows alerting on `push_failure_time_seconds > push_time_seconds`
  and tracks at the same time the last successful push.) There are
  some subtle aspects of implementing this correctly, mostly around
  the problem that a successful push should leave a pre-existing
  failure timestamp intact and vice versa.

- Previously, the push handler code already had a bunch of validation
  and sanitation code (block timestamps, sanitize labels) and also
  added the `push_time_seconds` metric. Now that the storage layer is
  doing the consistency check for the pushed metric, it is way cleaner
  to have all the other validation and sanitation code there as
  well. (Arguably, it was already wrong to have too much logic in the
  push handler.) Also, the decision needed to properly set the
  `push_time_seconds` and `push_faiure_time_seconds` metrics can now
  only be made after the consistency check, so those metrics have to
  be set in the storage layer anyway. This change of responsibility
  also changed the tests quite a bit. The push handler tests now test
  only that the right method calls happen to the storage, while the
  many tests for validation, consistency checks, adding the push
  timestamp metrics, and proper sanitation are now part of the storage
  tests. (Digression: One might argue that the cleanest solution is a
  separate validation layer in front of a dumb storage layer. Since
  the only implementation of the storage for now is the disk storage
  (and a mock storage for testing), I decided to not make that split
  prematurely.)

- Logging of rejected pushes happens ot error level now. Strictly
  speaking, it is just a user error and not an error of the PGW
  itself, if it is happening. But experience tells us that users first
  check the logs, and they usualyl don't run on debug or even info
  level. I hope this will avoid a whole lot of user support questions.
  Groups where the last push has failed are also marked in the web UI.

Signed-off-by: beorn7 <[email protected]>
@ananth-at-camphor-networks

Thanks so much. I very much appreciate it!

beorn7 added a commit that referenced this issue Sep 20, 2019
This is finally a solution to #202 (and as a byproduct to #94).

The original wisdom used to be that checking pushed metrics during the
push should not be done because it is inherently an O(n*m) problem: n
is the number of metrics in the PGW, and m is the number of pushes in
a certain time-frame. Upon each push, a check has to run, and that
check has a cost linear with the total number of metrics. Thus, if you
scale up something by a factor of 2, you can expect twice as many
pushers pushing in total twice as many metrics, in twice as many push
operations, resulting in 2*2=4 times the cost of checking
metrics. Since we didn't want to let pushers wait for a potentially
expensive check, we just accepted the push, as long as it was
parseable, and deferred the error detection to scrapes.

The result was that a single inconsistent push would result in an
error when scraping the PGW. The bright side here is that the problem
will be spotted quite easily, and there is a strong incentive to fix
it. The problematic part is that a single rogue pusher is affecting
all systems that depend on the same PGW, possibly at inconvenient
times.

In sane use cases for the PGW, scrapes are happening more often than
pushes. Let's say a database backup pushes its success or failure
outcome once per day. Scrapes happen a couple of times per
minute. While the whole O(n*m) calculation above is not wrong, the
reality is that the scrapes will inflict the cost anyway. If the check
is too expensive per push, it will be even more so per scrape. Also,
pushers shouldn't be too sensitive to slightly enlarged push
times. Finally, if a PGW is so heavily loaded with metrics that the
check actually takes prohibitively long, we can be quite sure that
this PGW is abused for something it was not meant for (like turning
Prometheus into a push-based system).

The solution is conceptually quite easy: On each push, first feed the
push into a copy of the current metrics state (which has a certain
cost but it is anyway done whenever somebody looks at the web UI) and
then simulate a scrape. In that way, we can reuse the validation power
of prometheus/client_golang. We don't even need to rearchitecture the
whole framework of queueing pushes and deletions. We only need to talk
back via an optional channel. Pushes will now get a 200 or a 400
rathen than the "(almost) always 202" response we had before.

Easy, isn't it?

Unfortunately, the easy change became _way_ more invasive than I
anticipated. Here is an overview of the changes in this commit (also
see the changes in README.md, which explain the usage side of things):

- Previously, the PUT request was executed in two parts: First, delete
  all metrics in the group, then add the newly pushed metrics to the
  group. That's now a problem. If the 2nd part fails, it will leave
  behind an empty group rather than the state before the whole PUT
  request, as the user will certainly expect. Thus, the operation had
  to be made atomic. To accomplish that, the `storage.WriteRequest`
  now has an additional field `Replace` to mark replacement of the
  whole group if true. This field is in addition to the back channel
  mentioned above (called `Done` in the `WriteRequest`).

- To enable alerting on failed pushes, there is now a new metric
  `push_failure_time_seconds` with the timestamp of the last failed
  push (or 0 if there never was a failed push). Ideally, we would
  rename the existing metric `push_time_seconds` to
  `push_success_time_seconds`, but that would break all existing uses
  of that metric. Thus, I left the name as is, although
  `push_time_seconds` is only updated upon a successful push. (This
  allows alerting on `push_failure_time_seconds > push_time_seconds`
  and tracks at the same time the last successful push.) There are
  some subtle aspects of implementing this correctly, mostly around
  the problem that a successful push should leave a pre-existing
  failure timestamp intact and vice versa.

- Previously, the push handler code already had a bunch of validation
  and sanitation code (block timestamps, sanitize labels) and also
  added the `push_time_seconds` metric. Now that the storage layer is
  doing the consistency check for the pushed metric, it is way cleaner
  to have all the other validation and sanitation code there as
  well. (Arguably, it was already wrong to have too much logic in the
  push handler.) Also, the decision needed to properly set the
  `push_time_seconds` and `push_faiure_time_seconds` metrics can now
  only be made after the consistency check, so those metrics have to
  be set in the storage layer anyway. This change of responsibility
  also changed the tests quite a bit. The push handler tests now test
  only that the right method calls happen to the storage, while the
  many tests for validation, consistency checks, adding the push
  timestamp metrics, and proper sanitation are now part of the storage
  tests. (Digression: One might argue that the cleanest solution is a
  separate validation layer in front of a dumb storage layer. Since
  the only implementation of the storage for now is the disk storage
  (and a mock storage for testing), I decided to not make that split
  prematurely.)

- Logging of rejected pushes happens ot error level now. Strictly
  speaking, it is just a user error and not an error of the PGW
  itself, if it is happening. But experience tells us that users first
  check the logs, and they usualyl don't run on debug or even info
  level. I hope this will avoid a whole lot of user support questions.
  Groups where the last push has failed are also marked in the web UI.

Signed-off-by: beorn7 <[email protected]>
@beorn7
Copy link
Member

beorn7 commented Sep 20, 2019

Fix is in master. Will be released once #287 is done. But please feel free to build from master and try it out. It's a fairly invasive change, and some pre-release feedback will be appreciated.

@beorn7 beorn7 closed this as completed Sep 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants