Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Check pushed metrics immediately and reject them if inconsistent
This is finally a solution to #202 (and as a byproduct to #94). The original wisdom used to be that checking pushed metrics during the push should not be done because it is inherently an O(n*m) problem: n is the number of metrics in the PGW, and m is the number of pushes in a certain time-frame. Upon each push, a check has to run, and that check has a cost linear with the total number of metrics. Thus, if you scale up something by a factor of 2, you can expect twice as many pushers pushing in total twice as many metrics, in twice as many push operations, resulting in 2*2=4 times the cost of checking metrics. Since we didn't want to let pushers wait for a potentially expensive check, we just accepted the push, as long as it was parseable, and deferred the error detection to scrapes. The result was that a single inconsistent push would result in an error when scraping the PGW. The bright side here is that the problem will be spotted quite easily, and there is a strong incentive to fix it. The problematic part is that a single rogue pusher is affecting all systems that depend on the same PGW, possibly at inconvenient times. In sane use cases for the PGW, scrapes are happening more often than pushes. Let's say a database backup pushes its success or failure outcome once per day. Scrapes happen a couple of times per minute. While the whole O(n*m) calculation above is not wrong, the reality is that the scrapes will inflict the cost anyway. If the check is too expensive per push, it will be even more so per scrape. Also, pushers shouldn't be too sensitive to slightly enlarged push times. Finally, if a PGW is so heavily loaded with metrics that the check actually takes prohibitively long, we can be quite sure that this PGW is abused for something it was not meant for (like turning Prometheus into a push-based system). The solution is conceptually quite easy: On each push, first feed the push into a copy of the current metrics state (which has a certain cost but it is anyway done whenever somebody looks at the web UI) and then simulate a scrape. In that way, we can reuse the validation power of prometheus/client_golang. We don't even need to rearchitecture the whole framework of queueing pushes and deletions. We only need to talk back via an optional channel. Pushes will now get a 200 or a 400 rathen than the "(almost) always 202" response we had before. Easy, isn't it? Unfortunately, the easy change became _way_ more invasive than I anticipated. Here is an overview of the changes in this commit (also see the changes in README.md, which explain the usage side of things): - Previously, the PUT request was executed in two parts: First, delete all metrics in the group, then add the newly pushed metrics to the group. That's now a problem. If the 2nd part fails, it will leave behind an empty group rather than the state before the whole PUT request, as the user will certainly expect. Thus, the operation had to be made atomic. To accomplish that, the `storage.WriteRequest` now has an additional field `Replace` to mark replacement of the whole group if true. This field is in addition to the back channel mentioned above (called `Done` in the `WriteRequest`). - To enable alerting on failed pushes, there is now a new metric `push_failure_time_seconds` with the timestamp of the last failed push (or 0 if there never was a failed push). Ideally, we would rename the existing metric `push_time_seconds` to `push_success_time_seconds`, but that would break all existing uses of that metric. Thus, I left the name as is, although `push_time_seconds` is only updated upon a successful push. (This allows alerting on `push_failure_time_seconds > push_time_seconds` and tracks at the same time the last successful push.) There are some subtle aspects of implementing this correctly, mostly around the problem that a successful push should leave a pre-existing failure timestamp intact and vice versa. - Previously, the push handler code already had a bunch of validation and sanitation code (block timestamps, sanitize labels) and also added the `push_time_seconds` metric. Now that the storage layer is doing the consistency check for the pushed metric, it is way cleaner to have all the other validation and sanitation code there as well. (Arguably, it was already wrong to have too much logic in the push handler.) Also, the decision needed to properly set the `push_time_seconds` and `push_faiure_time_seconds` metrics can now only be made after the consistency check, so those metrics have to be set in the storage layer anyway. This change of responsibility also changed the tests quite a bit. The push handler tests now test only that the right method calls happen to the storage, while the many tests for validation, consistency checks, adding the push timestamp metrics, and proper sanitation are now part of the storage tests. (Digression: One might argue that the cleanest solution is a separate validation layer in front of a dumb storage layer. Since the only implementation of the storage for now is the disk storage (and a mock storage for testing), I decided to not make that split prematurely.) - Logging of rejected pushes happens ot error level now. Strictly speaking, it is just a user error and not an error of the PGW itself, if it is happening. But experience tells us that users first check the logs, and they usualyl don't run on debug or even info level. I hope this will avoid a whole lot of user support questions. Groups where the last push has failed are also marked in the web UI. Signed-off-by: beorn7 <[email protected]>
- Loading branch information