Check pushed metrics immediately and reject them if inconsistent

This is finally a solution to #202 (and as a byproduct to #94). The original wisdom used to be that checking pushed metrics during the push should not be done because it is inherently an O(n*m) problem: n is the number of metrics in the PGW, and m is the number of pushes in a certain time-frame. Upon each push, a check has to run, and that check has a cost linear with the total number of metrics. Thus, if you scale up something by a factor of 2, you can expect twice as many pushers pushing in total twice as many metrics, in twice as many push operations, resulting in 2*2=4 times the cost of checking metrics. Since we didn't want to let pushers wait for a potentially expensive check, we just accepted the push, as long as it was parseable, and deferred the error detection to scrapes. The result was that a single inconsistent push would result in an error when scraping the PGW. The bright side here is that the problem will be spotted quite easily, and there is a strong incentive to fix it. The problematic part is that a single rogue pusher is affecting all systems that depend on the same PGW, possibly at inconvenient times. In sane use cases for the PGW, scrapes are happening more often than pushes. Let's say a database backup pushes its success or failure outcome once per day. Scrapes happen a couple of times per minute. While the whole O(n*m) calculation above is not wrong, the reality is that the scrapes will inflict the cost anyway. If the check is too expensive per push, it will be even more so per scrape. Also, pushers shouldn't be too sensitive to slightly enlarged push times. Finally, if a PGW is so heavily loaded with metrics that the check actually takes prohibitively long, we can be quite sure that this PGW is abused for something it was not meant for (like turning Prometheus into a push-based system). The solution is conceptually quite easy: On each push, first feed the push into a copy of the current metrics state (which has a certain cost but it is anyway done whenever somebody looks at the web UI) and then simulate a scrape. In that way, we can reuse the validation power of prometheus/client_golang. We don't even need to rearchitecture the whole framework of queueing pushes and deletions. We only need to talk back via an optional channel. Pushes will now get a 200 or a 400 rathen than the "(almost) always 202" response we had before. Easy, isn't it? Unfortunately, the easy change became _way_ more invasive than I anticipated. Here is an overview of the changes in this commit (also see the changes in README.md, which explain the usage side of things): - Previously, the PUT request was executed in two parts: First, delete all metrics in the group, then add the newly pushed metrics to the group. That's now a problem. If the 2nd part fails, it will leave behind an empty group rather than the state before the whole PUT request, as the user will certainly expect. Thus, the operation had to be made atomic. To accomplish that, the `storage.WriteRequest` now has an additional field `Replace` to mark replacement of the whole group if true. This field is in addition to the back channel mentioned above (called `Done` in the `WriteRequest`). - To enable alerting on failed pushes, there is now a new metric `push_failure_time_seconds` with the timestamp of the last failed push (or 0 if there never was a failed push). Ideally, we would rename the existing metric `push_time_seconds` to `push_success_time_seconds`, but that would break all existing uses of that metric. Thus, I left the name as is, although `push_time_seconds` is only updated upon a successful push. (This allows alerting on `push_failure_time_seconds > push_time_seconds` and tracks at the same time the last successful push.) There are some subtle aspects of implementing this correctly, mostly around the problem that a successful push should leave a pre-existing failure timestamp intact and vice versa. - Previously, the push handler code already had a bunch of validation and sanitation code (block timestamps, sanitize labels) and also added the `push_time_seconds` metric. Now that the storage layer is doing the consistency check for the pushed metric, it is way cleaner to have all the other validation and sanitation code there as well. (Arguably, it was already wrong to have too much logic in the push handler.) Also, the decision needed to properly set the `push_time_seconds` and `push_faiure_time_seconds` metrics can now only be made after the consistency check, so those metrics have to be set in the storage layer anyway. This change of responsibility also changed the tests quite a bit. The push handler tests now test only that the right method calls happen to the storage, while the many tests for validation, consistency checks, adding the push timestamp metrics, and proper sanitation are now part of the storage tests. (Digression: One might argue that the cleanest solution is a separate validation layer in front of a dumb storage layer. Since the only implementation of the storage for now is the disk storage (and a mock storage for testing), I decided to not make that split prematurely.) - Logging of rejected pushes happens ot error level now. Strictly speaking, it is just a user error and not an error of the PGW itself, if it is happening. But experience tells us that users first check the logs, and they usualyl don't run on debug or even info level. I hope this will avoid a whole lot of user support questions. Groups where the last push has failed are also marked in the web UI. Signed-off-by: beorn7 <[email protected]>
prometheus · Sep 20, 2019 · a112ae5 · a112ae5
1 parent 16ec84f
commit a112ae5
Show file tree

Hide file tree

Showing 11 changed files with 1,199 additions and 395 deletions.
diff --git a/README.md b/README.md
@@ -138,7 +138,7 @@ Examples:
 
         curl -X DELETE http://pushgateway.example.org:9091/metrics/job/some_job
         
-* Delete all metrics in all groups (requires to enable the admin api`--web.enable-admin-api`):
+* Delete all metrics in all groups (requires to enable the admin API via the command line flag `--web.enable-admin-api`):
 
         curl -X PUT http://pushgateway.example.org:9091/api/v1/admin/wipe
 
@@ -202,10 +202,12 @@ timestamps.
 If you think you need to push a timestamp, please see [When To Use The
 Pushgateway](https://prometheus.io/docs/practices/pushing/).
 
-In order to make it easier to alert on pushers that have not run recently, the
-Pushgateway will add in a metric `push_time_seconds` with the Unix timestamp
-of the last `POST`/`PUT` to each group. This will override any pushed metric by
-that name.
+In order to make it easier to alert on failed pushers or those that have not
+run recently, the Pushgateway will add in the metrics `push_time_seconds` and
+`push_failure_time_seconds` with the Unix timestamp of the last successful and
+failed `POST`/`PUT` to each group. This will override any pushed metric by that
+name. A value of zero for either metric implies that the group has never seen a
+successful or failed `POST`/`PUT`.
 
 ## API
 
@@ -277,20 +279,24 @@ header. (Use the value `application/vnd.google.protobuf;
 proto=io.prometheus.client.MetricFamily; encoding=delimited` for protocol
 buffers, otherwise the text format is tried as a fall-back.)
 
-The response code upon success is always 202 (even if the same
-grouping key has never been used before, i.e. there is no feedback to
-the client if the push has replaced an existing group of metrics or
-created a new one).
+The response code upon success is either 200 or 400. A 200 response implies a
+successful push, either replacing an existing group of metrics or creating a
+new one. A 400 response can happen if the request is malformed or if the pushed
+metrics are inconsistent with metrics pushed to other groups or collide with
+metrics of the Pushgateway itself. An explanation is returned in the body of
+the response and logged on error level.
+
+In rare cases, it is possible that the Pushgateway ends up with an inconsistent
+set of metrics already pushed. In that case, new pushes are also rejected as
+inconsistent even if the culprit is metrics that were pushed earlier. Delete
+the offending metrics to get out of that situation.
 
 _If using the protobuf format, do not send duplicate MetricFamily
 proto messages (i.e. more than one with the same name) in one push, as
 they will overwrite each other._
 
-A successfully finished request means that the pushed metrics are
-queued for an update of the storage. Scraping the push gateway may
-still yield the old results until the queued update is
-processed. Neither is there a guarantee that the pushed metrics are
-persisted to disk. (A server crash may cause data loss. Or the push
+Note that the Pushgateway doesn't provide any strong guarantees that the pushed
+metrics are persisted to disk. (A server crash may cause data loss. Or the push
 gateway is configured to not persist to disk at all.)
 
 A `PUT` request with an empty body effectively deletes all metrics with the
@@ -351,12 +357,14 @@ The default port the Pushgateway is listening to is 9091. The path looks like:
 The Pushgateway exposes the following metrics via the configured
 `--web.telemetry-path` (default: `/metrics`):
 - The pushed metrics.
-- For each pushed group, a metric `push_time_seconds` as explained above.
+- For each pushed group, a metric `push_time_seconds` and
+  `push_failure_time_seconds` as explained above.
 - The usual metrics provided by the [Prometheus Go client library](https://github.com/prometheus/client_golang), i.e.:
   - `process_...`
   - `go_...`
   - `promhttp_metric_handler_requests_...`
-- A number of metrics specific to the Pushgateway, as documented by the example scrape below.
+- A number of metrics specific to the Pushgateway, as documented by the example
+  scrape below.
 
 ```
 # HELP pushgateway_build_info A metric with a constant '1' value labeled by version, revision, branch, and goversion from which pushgateway was built.
@@ -385,7 +393,23 @@ pushgateway_http_requests_total{code="202",handler="push",method="post"} 6
 pushgateway_http_requests_total{code="400",handler="push",method="post"} 2
 
 ```
-
+
+### Alerting on failed pushes
+
+It is in general a good idea to alert on `push_time_seconds` being much farther
+behind than expected. This will catch both failed pushes as well as pushers
+being down completely.
+
+To detect failed pushes much earlier, alert on `push_failure_time_seconds >
+push_time_seconds`.
+
+Pushes can also fail because they are malformed. In this case, they never reach
+any metric group and therefore won't set any `push_failure_time_seconds`
+metrics. Those pushes are still counted as
+`pushgateway_http_requests_total{code="400",handler="push"}`. You can alert on
+the `rate` of this metric, but you have to inspect the logs to identify the
+offending pusher.
+
 ## Development
 
 The normal binary embeds the web files in the `resources` directory.

diff --git a/asset/assets_vfsdata.go b/asset/assets_vfsdata.go
diff --git a/go.mod b/go.mod
@@ -15,7 +15,7 @@ require (
 	github.com/shurcooL/httpfs v0.0.0-20190707220628-8d4bc4ba7749 // indirect
 	github.com/shurcooL/vfsgen v0.0.0-20181202132449-6a9ea43bcacd
 	golang.org/x/sys v0.0.0-20190909082730-f460065e899a // indirect
-	golang.org/x/tools v0.0.0-20190731214159-1e85ed8060aa // indirect
+	golang.org/x/tools v0.0.0-20190919031856-7460b8e10b7e // indirect
 	gopkg.in/alecthomas/kingpin.v2 v2.2.6
 )
 

diff --git a/go.sum b/go.sum
@@ -25,6 +25,7 @@ github.com/go-logfmt/logfmt v0.4.0 h1:MP4Eh7ZCb31lleYCFuwm0oe4/YGak+5l1vA2NOE80n
 github.com/go-logfmt/logfmt v0.4.0/go.mod h1:3RMwSq7FuexP4Kalkev3ejPJsZTpXXBr9+V4qmtdjCk=
 github.com/go-stack/stack v1.8.0 h1:5SgMzNM5HxrEjV0ww2lTmX6E2Izsfxas4+YHWRs3Lsk=
 github.com/go-stack/stack v1.8.0/go.mod h1:v0f6uXyyMGvRgIKkXu+yp6POWl0qKG85gN/melR3HDY=
+github.com/gogo/protobuf v1.1.1 h1:72R+M5VuhED/KujmZVcIquuo8mBgX4oVda//DQb3PXo=
 github.com/gogo/protobuf v1.1.1/go.mod h1:r8qH/GZQm5c6nD/R0oafs1akxWv10x8SbQlK7atdtwQ=
 github.com/golang/protobuf v1.2.0 h1:P3YflyNX/ehuJFLhxviNdFxQPkGK5cDcApsge1SqnvM=
 github.com/golang/protobuf v1.2.0/go.mod h1:6lQm79b+lXiMfvg/cZm0SGofjICqVBUtrP5yJMmIC1U=
@@ -95,10 +96,12 @@ golang.org/x/crypto v0.0.0-20190308221718-c2843e01d9a2 h1:VklqNMn3ovrHsnt90Pveol
 golang.org/x/crypto v0.0.0-20190308221718-c2843e01d9a2/go.mod h1:djNgcEr1/C05ACkg1iLfiJU5Ep61QUkGW8qpdssI0+w=
 golang.org/x/net v0.0.0-20181114220301-adae6a3d119a/go.mod h1:mL1N/T3taQHkDXs73rZJwtUhF3w3ftmwwsq0BUmARs4=
 golang.org/x/net v0.0.0-20190613194153-d28f0bde5980/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s=
+golang.org/x/net v0.0.0-20190620200207-3b0461eec859 h1:R/3boaszxrf1GEUWTVDzSKVwLmSJpwZ1yqXm8j0v2QI=
 golang.org/x/net v0.0.0-20190620200207-3b0461eec859/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s=
 golang.org/x/sync v0.0.0-20181108010431-42b317875d0f h1:Bl/8QSvNqXvPGPGXa2z5xUTmV7VDcZyvRZ+QQXkXTZQ=
 golang.org/x/sync v0.0.0-20181108010431-42b317875d0f/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
 golang.org/x/sync v0.0.0-20181221193216-37e7f081c4d4/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
+golang.org/x/sync v0.0.0-20190423024810-112230192c58 h1:8gQV6CLnAEikrhgkHFbMAEhagSSnXWGV915qUMm9mrU=
 golang.org/x/sync v0.0.0-20190423024810-112230192c58/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
 golang.org/x/sys v0.0.0-20180905080454-ebe1bf3edb33/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY=
 golang.org/x/sys v0.0.0-20181116152217-5ac8a444bdc5/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY=
@@ -107,10 +110,14 @@ golang.org/x/sys v0.0.0-20190801041406-cbf593c0f2f3 h1:4y9KwBHBgBNwDbtu44R5o1fdO
 golang.org/x/sys v0.0.0-20190801041406-cbf593c0f2f3/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
 golang.org/x/sys v0.0.0-20190909082730-f460065e899a h1:mIzbOulag9/gXacgxKlFVwpCOWSfBT3/pDyyCwGA9as=
 golang.org/x/sys v0.0.0-20190909082730-f460065e899a/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
+golang.org/x/text v0.3.0 h1:g61tztE5qeGQ89tm6NTjjM9VPIm088od1l6aSorWRWg=
 golang.org/x/text v0.3.0/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ=
-golang.org/x/tools v0.0.0-20190731214159-1e85ed8060aa h1:kwa/4M1dbmhZqOIqYiTtbA6JrvPwo1+jqlub2qDXX90=
-golang.org/x/tools v0.0.0-20190731214159-1e85ed8060aa/go.mod h1:jcCCGcm9btYwXyDqrUWc6MKQKKGJCWEQ3AfLSRIbEuI=
+golang.org/x/tools v0.0.0-20190919031856-7460b8e10b7e h1:DxffoHYXmce3WTEBU/6/5bBSV7wmPSvT+atzBfv8hJI=
+golang.org/x/tools v0.0.0-20190919031856-7460b8e10b7e/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo=
+golang.org/x/xerrors v0.0.0-20190717185122-a985d3407aa7 h1:9zdDQZ7Thm29KFXgAX/+yaf3eVbP7djjWp/dXAppNCc=
+golang.org/x/xerrors v0.0.0-20190717185122-a985d3407aa7/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
 gopkg.in/alecthomas/kingpin.v2 v2.2.6 h1:jMFz6MfLP0/4fUyZle81rXUoxOBFi19VUFKVDOQfozc=
 gopkg.in/alecthomas/kingpin.v2 v2.2.6/go.mod h1:FMv+mEhP44yOT+4EoQTLFTRgOQ1FBLkstjWtayDeSgw=
 gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
+gopkg.in/yaml.v2 v2.2.1 h1:mUhvW9EsL+naU5Q3cakzfE91YhliOondGd6ZrsDBHQE=
 gopkg.in/yaml.v2 v2.2.1/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI=