-
Notifications
You must be signed in to change notification settings - Fork 469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reject inconsistent metrics at push time (and create a synthetic metric about it) #202
Comments
If this is about inconsistent label dimensions: They should be fine with the new client library, see prometheus/client_golang#417 . I'll release a new Pushgateway soon, using the client library with this change. However, if the metric is problematic in other ways, the behavior will still be the same, i.e. the push will succeed, but the scrape not. That's mostly for performance reasons, checking on each push has O(n²) runtime. In practice, it might not be that bad, but that has to be explored. The change in the code would be quite invasive. |
Hi @beorn7 , In understand that could be quite invasive , but having the whole application fail to return metrics ( maybe some thousands of metrics ) because one of hundreds endpoint sent an invalid one it's a wrong behaviour, if you or other pushgateway developer are worried about performance or being too invasive change, let's user enable it with a arg-flag... I'll test the new version when is out and let you know anyway :) |
First of all, if you have thousands of metrics on your PGW, the odds are you are using it wrong. See https://prometheus.io/docs/practices/pushing/ . The PGW is fairly light-weight. In general, I would recommend to us one per team or service needing it. In that way, collisions of metrics are less likely, and a failure won't affect larger groups. About the O(n²): This is just in the nature of the problem. Checking if the total exposition the PGW would offer to scrape is consistent is broadly proportional to the number of metrics pushed. Also, the check has to be performed on each push. Thus, if you double the number of pushers, you double the number of checks, and each check takes twice as long as double the metrics are pushed. Computational effort is now increased by a factor of four. One might improve this with some hash tables tracking the already pushed metrics, but it makes the code even more involved. In any case, a rejected push needed to be represented in the scrape in some way. You must be able to alert on it. |
Is there any update on this ? I tried with the latest version available now (0.9.1) but get the same set of errors. I do not understand the logic at all as to why /metrics get has to fail... it is the push that should fail, not get. Other option is to provide a get end-point. If so, then clients can call get before attempting the push... (I still don't udnerstand what O(n^2) referred to here..). Please do not fail GET on /metrics. Scraping will stop working due to this issue (and it was working few months ago) |
The state is that the problem is well understood and a solution is sketched out in my head. |
This is finally a solution to #202 (and as a byproduct to #94). The original wisdom used to be that checking pushed metrics during the push should not be done because it is inherently an O(n*m) problem: n is the number of metrics in the PGW, and m is the number of pushes in a certain time-frame. Upon each push, a check has to run, and that check has a cost linear with the total number of metrics. Thus, if you scale up something by a factor of 2, you can expect twice as many pushers pushing in total twice as many metrics, in twice as many push operations, resulting in 2*2=4 times the cost of checking metrics. Since we didn't want to let pushers wait for a potentially expensive check, we just accepted the push, as long as it was parseable, and deferred the error detection to scrapes. The result was that a single inconsistent push would result in an error when scraping the PGW. The bright side here is that the problem will be spotted quite easily, and there is a strong incentive to fix it. The problematic part is that a single rogue pusher is affecting all systems that depend on the same PGW, possibly at inconvenient times. In sane use cases for the PGW, scrapes are happening more often than pushes. Let's say a database backup pushes its success or failure outcome once per day. Scrapes happen a couple of times per minute. While the whole O(n*m) calculation above is not wrong, the reality is that the scrapes will inflict the cost anyway. If the check is too expensive per push, it will be even more so per scrape. Also, pushers shouldn't be too sensitive to slightly enlarged push times. Finally, if a PGW is so heavily loaded with metrics that the check actually takes prohibitively long, we can be quite sure that this PGW is abused for something it was not meant for (like turning Prometheus into a push-based system). The solution is conceptually quite easy: On each push, first feed the push into a copy of the current metrics state (which has a certain cost but it is anyway done whenever somebody looks at the web UI) and then simulate a scrape. In that way, we can reuse the validation power of prometheus/client_golang. We don't even need to rearchitecture the whole framework of queueing pushes and deletions. We only need to talk back via an optional channel. Pushes will now get a 200 or a 400 rathen than the "(almost) always 202" response we had before. Easy, isn't it? Unfortunately, the easy change became _way_ more invasive than I anticipated. Here is an overview of the changes in this commit (also see the changes in README.md, which explain the usage side of things): - Previously, the PUT request was executed in two parts: First, delete all metrics in the group, then add the newly pushed metrics to the group. That's now a problem. If the 2nd part fails, it will leave behind an empty group rather than the state before the whole PUT request, as the user will certainly expect. Thus, the operation had to be made atomic. To accomplish that, the `storage.WriteRequest` now has an additional field `Replace` to mark replacement of the whole group if true. This field is in addition to the back channel mentioned above (called `Done` in the `WriteRequest`). - To enable alerting on failed pushes, there is now a new metric `push_failure_time_seconds` with the timestamp of the last failed push (or 0 if there never was a failed push). Ideally, we would rename the existing metric `push_time_seconds` to `push_success_time_seconds`, but that would break all existing uses of that metric. Thus, I left the name as is, although `push_time_seconds` is only updated upon a successful push. (This allows alerting on `push_failure_time_seconds > push_time_seconds` and tracks at the same time the last successful push.) There are some subtle aspects of implementing this correctly, mostly around the problem that a successful push should leave a pre-existing failure timestamp intact and vice versa. - Previously, the push handler code already had a bunch of validation and sanitation code (block timestamps, sanitize labels) and also added the `push_time_seconds` metric. Now that the storage layer is doing the consistency check for the pushed metric, it is way cleaner to have all the other validation and sanitation code there as well. (Arguably, it was already wrong to have too much logic in the push handler.) Also, the decision needed to properly set the `push_time_seconds` and `push_faiure_time_seconds` metrics can now only be made after the consistency check, so those metrics have to be set in the storage layer anyway. This change of responsibility also changed the tests quite a bit. The push handler tests now test only that the right method calls happen to the storage, while the many tests for validation, consistency checks, adding the push timestamp metrics, and proper sanitation are now part of the storage tests. (Digression: One might argue that the cleanest solution is a separate validation layer in front of a dumb storage layer. Since the only implementation of the storage for now is the disk storage (and a mock storage for testing), I decided to not make that split prematurely.) - Logging of rejected pushes happens ot error level now. Strictly speaking, it is just a user error and not an error of the PGW itself, if it is happening. But experience tells us that users first check the logs, and they usualyl don't run on debug or even info level. I hope this will avoid a whole lot of user support questions. Groups where the last push has failed are also marked in the web UI. Signed-off-by: beorn7 <[email protected]>
Thanks so much. I very much appreciate it! |
This is finally a solution to #202 (and as a byproduct to #94). The original wisdom used to be that checking pushed metrics during the push should not be done because it is inherently an O(n*m) problem: n is the number of metrics in the PGW, and m is the number of pushes in a certain time-frame. Upon each push, a check has to run, and that check has a cost linear with the total number of metrics. Thus, if you scale up something by a factor of 2, you can expect twice as many pushers pushing in total twice as many metrics, in twice as many push operations, resulting in 2*2=4 times the cost of checking metrics. Since we didn't want to let pushers wait for a potentially expensive check, we just accepted the push, as long as it was parseable, and deferred the error detection to scrapes. The result was that a single inconsistent push would result in an error when scraping the PGW. The bright side here is that the problem will be spotted quite easily, and there is a strong incentive to fix it. The problematic part is that a single rogue pusher is affecting all systems that depend on the same PGW, possibly at inconvenient times. In sane use cases for the PGW, scrapes are happening more often than pushes. Let's say a database backup pushes its success or failure outcome once per day. Scrapes happen a couple of times per minute. While the whole O(n*m) calculation above is not wrong, the reality is that the scrapes will inflict the cost anyway. If the check is too expensive per push, it will be even more so per scrape. Also, pushers shouldn't be too sensitive to slightly enlarged push times. Finally, if a PGW is so heavily loaded with metrics that the check actually takes prohibitively long, we can be quite sure that this PGW is abused for something it was not meant for (like turning Prometheus into a push-based system). The solution is conceptually quite easy: On each push, first feed the push into a copy of the current metrics state (which has a certain cost but it is anyway done whenever somebody looks at the web UI) and then simulate a scrape. In that way, we can reuse the validation power of prometheus/client_golang. We don't even need to rearchitecture the whole framework of queueing pushes and deletions. We only need to talk back via an optional channel. Pushes will now get a 200 or a 400 rathen than the "(almost) always 202" response we had before. Easy, isn't it? Unfortunately, the easy change became _way_ more invasive than I anticipated. Here is an overview of the changes in this commit (also see the changes in README.md, which explain the usage side of things): - Previously, the PUT request was executed in two parts: First, delete all metrics in the group, then add the newly pushed metrics to the group. That's now a problem. If the 2nd part fails, it will leave behind an empty group rather than the state before the whole PUT request, as the user will certainly expect. Thus, the operation had to be made atomic. To accomplish that, the `storage.WriteRequest` now has an additional field `Replace` to mark replacement of the whole group if true. This field is in addition to the back channel mentioned above (called `Done` in the `WriteRequest`). - To enable alerting on failed pushes, there is now a new metric `push_failure_time_seconds` with the timestamp of the last failed push (or 0 if there never was a failed push). Ideally, we would rename the existing metric `push_time_seconds` to `push_success_time_seconds`, but that would break all existing uses of that metric. Thus, I left the name as is, although `push_time_seconds` is only updated upon a successful push. (This allows alerting on `push_failure_time_seconds > push_time_seconds` and tracks at the same time the last successful push.) There are some subtle aspects of implementing this correctly, mostly around the problem that a successful push should leave a pre-existing failure timestamp intact and vice versa. - Previously, the push handler code already had a bunch of validation and sanitation code (block timestamps, sanitize labels) and also added the `push_time_seconds` metric. Now that the storage layer is doing the consistency check for the pushed metric, it is way cleaner to have all the other validation and sanitation code there as well. (Arguably, it was already wrong to have too much logic in the push handler.) Also, the decision needed to properly set the `push_time_seconds` and `push_faiure_time_seconds` metrics can now only be made after the consistency check, so those metrics have to be set in the storage layer anyway. This change of responsibility also changed the tests quite a bit. The push handler tests now test only that the right method calls happen to the storage, while the many tests for validation, consistency checks, adding the push timestamp metrics, and proper sanitation are now part of the storage tests. (Digression: One might argue that the cleanest solution is a separate validation layer in front of a dumb storage layer. Since the only implementation of the storage for now is the disk storage (and a mock storage for testing), I decided to not make that split prematurely.) - Logging of rejected pushes happens ot error level now. Strictly speaking, it is just a user error and not an error of the PGW itself, if it is happening. But experience tells us that users first check the logs, and they usualyl don't run on debug or even info level. I hope this will avoid a whole lot of user support questions. Groups where the last push has failed are also marked in the web UI. Signed-off-by: beorn7 <[email protected]>
Fix is in master. Will be released once #287 is done. But please feel free to build from master and try it out. It's a fairly invasive change, and some pre-release feedback will be appreciated. |
and instead it should log the malformed data that has received and then discard it..
thre is already an issue in collectd exporter.. and I've already discussed it there
prometheus/collectd_exporter#56 (comment)
I have no skill in go and zero time to help, sorry :(
The text was updated successfully, but these errors were encountered: