Skip to content

Commit

Permalink
Check pushed metrics immediately and reject them if inconsistent
Browse files Browse the repository at this point in the history
This is finally a solution to #202 (and as a byproduct to #94).

The original wisdom used to be that checking pushed metrics during the
push should not be done because it is inherently an O(n*m) problem: n
is the number of metrics in the PGW, and m is the number of pushes in
a certain time-frame. Upon each push, a check has to run, and that
check has a cost linear with the total number of metrics. Thus, if you
scale up something by a factor of 2, you can expect twice as many
pushers pushing in total twice as many metrics, in twice as many push
operations, resulting in 2*2=4 times the cost of checking
metrics. Since we didn't want to let pushers wait for a potentially
expensive check, we just accepted the push, as long as it was
parseable, and deferred the error detection to scrapes.

The result was that a single inconsistent push would result in an
error when scraping the PGW. The bright side here is that the problem
will be spotted quite easily, and there is a strong incentive to fix
it. The problematic part is that a single rogue pusher is affecting
all systems that depend on the same PGW, possibly at inconvenient
times.

In sane use cases for the PGW, scrapes are happening more often than
pushes. Let's say a database backup pushes its success or failure
outcome once per day. Scrapes happen a couple of times per
minute. While the whole O(n*m) calculation above is not wrong, the
reality is that the scrapes will inflict the cost anyway. If the check
is too expensive per push, it will be even more so per scrape. Also,
pushers shouldn't be too sensitive to slightly enlarged push
times. Finally, if a PGW is so heavily loaded with metrics that the
check actually takes prohibitively long, we can be quite sure that
this PGW is abused for something it was not meant for (like turning
Prometheus into a push-based system).

The solution is conceptually quite easy: On each push, first feed the
push into a copy of the current metrics state (which has a certain
cost but it is anyway done whenever somebody looks at the web UI) and
then simulate a scrape. In that way, we can reuse the validation power
of prometheus/client_golang. We don't even need to rearchitecture the
whole framework of queueing pushes and deletions. We only need to talk
back via an optional channel. Pushes will now get a 200 or a 400
rathen than the "(almost) always 202" response we had before.

Easy, isn't it?

Unfortunately, the easy change became _way_ more invasive than I
anticipated. Here is an overview of the changes in this commit (also
see the changes in README.md, which explain the usage side of things):

- Previously, the PUT request was executed in two parts: First, delete
  all metrics in the group, then add the newly pushed metrics to the
  group. That's now a problem. If the 2nd part fails, it will leave
  behind an empty group rather than the state before the whole PUT
  request, as the user will certainly expect. Thus, the operation had
  to be made atomic. To accomplish that, the `storage.WriteRequest`
  now has an additional field `Replace` to mark replacement of the
  whole group if true. This field is in addition to the back channel
  mentioned above (called `Done` in the `WriteRequest`).

- To enable alerting on failed pushes, there is now a new metric
  `push_failure_time_seconds` with the timestamp of the last failed
  push (or 0 if there never was a failed push). Ideally, we would
  rename the existing metric `push_time_seconds` to
  `push_success_time_seconds`, but that would break all existing uses
  of that metric. Thus, I left the name as is, although
  `push_time_seconds` is only updated upon a successful push. (This
  allows alerting on `push_failure_time_seconds > push_time_seconds`
  and tracks at the same time the last successful push.) There are
  some subtle aspects of implementing this correctly, mostly around
  the problem that a successful push should leave a pre-existing
  failure timestamp intact and vice versa.

- Previously, the push handler code already had a bunch of validation
  and sanitation code (block timestamps, sanitize labels) and also
  added the `push_time_seconds` metric. Now that the storage layer is
  doing the consistency check for the pushed metric, it is way cleaner
  to have all the other validation and sanitation code there as
  well. (Arguably, it was already wrong to have too much logic in the
  push handler.) Also, the decision needed to properly set the
  `push_time_seconds` and `push_faiure_time_seconds` metrics can now
  only be made after the consistency check, so those metrics have to
  be set in the storage layer anyway. This change of responsibility
  also changed the tests quite a bit. The push handler tests now test
  only that the right method calls happen to the storage, while the
  many tests for validation, consistency checks, adding the push
  timestamp metrics, and proper sanitation are now part of the storage
  tests. (Digression: One might argue that the cleanest solution is a
  separate validation layer in front of a dumb storage layer. Since
  the only implementation of the storage for now is the disk storage
  (and a mock storage for testing), I decided to not make that split
  prematurely.)

- Logging of rejected pushes happens ot error level now. Strictly
  speaking, it is just a user error and not an error of the PGW
  itself, if it is happening. But experience tells us that users first
  check the logs, and they usualyl don't run on debug or even info
  level. I hope this will avoid a whole lot of user support questions.
  Groups where the last push has failed are also marked in the web UI.

Signed-off-by: beorn7 <[email protected]>
  • Loading branch information
beorn7 committed Sep 20, 2019
1 parent 16ec84f commit a112ae5
Show file tree
Hide file tree
Showing 11 changed files with 1,199 additions and 395 deletions.
58 changes: 41 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,7 @@ Examples:

curl -X DELETE http://pushgateway.example.org:9091/metrics/job/some_job
* Delete all metrics in all groups (requires to enable the admin api`--web.enable-admin-api`):
* Delete all metrics in all groups (requires to enable the admin API via the command line flag `--web.enable-admin-api`):

curl -X PUT http://pushgateway.example.org:9091/api/v1/admin/wipe

Expand Down Expand Up @@ -202,10 +202,12 @@ timestamps.
If you think you need to push a timestamp, please see [When To Use The
Pushgateway](https://prometheus.io/docs/practices/pushing/).

In order to make it easier to alert on pushers that have not run recently, the
Pushgateway will add in a metric `push_time_seconds` with the Unix timestamp
of the last `POST`/`PUT` to each group. This will override any pushed metric by
that name.
In order to make it easier to alert on failed pushers or those that have not
run recently, the Pushgateway will add in the metrics `push_time_seconds` and
`push_failure_time_seconds` with the Unix timestamp of the last successful and
failed `POST`/`PUT` to each group. This will override any pushed metric by that
name. A value of zero for either metric implies that the group has never seen a
successful or failed `POST`/`PUT`.

## API

Expand Down Expand Up @@ -277,20 +279,24 @@ header. (Use the value `application/vnd.google.protobuf;
proto=io.prometheus.client.MetricFamily; encoding=delimited` for protocol
buffers, otherwise the text format is tried as a fall-back.)

The response code upon success is always 202 (even if the same
grouping key has never been used before, i.e. there is no feedback to
the client if the push has replaced an existing group of metrics or
created a new one).
The response code upon success is either 200 or 400. A 200 response implies a
successful push, either replacing an existing group of metrics or creating a
new one. A 400 response can happen if the request is malformed or if the pushed
metrics are inconsistent with metrics pushed to other groups or collide with
metrics of the Pushgateway itself. An explanation is returned in the body of
the response and logged on error level.

In rare cases, it is possible that the Pushgateway ends up with an inconsistent
set of metrics already pushed. In that case, new pushes are also rejected as
inconsistent even if the culprit is metrics that were pushed earlier. Delete
the offending metrics to get out of that situation.

_If using the protobuf format, do not send duplicate MetricFamily
proto messages (i.e. more than one with the same name) in one push, as
they will overwrite each other._

A successfully finished request means that the pushed metrics are
queued for an update of the storage. Scraping the push gateway may
still yield the old results until the queued update is
processed. Neither is there a guarantee that the pushed metrics are
persisted to disk. (A server crash may cause data loss. Or the push
Note that the Pushgateway doesn't provide any strong guarantees that the pushed
metrics are persisted to disk. (A server crash may cause data loss. Or the push
gateway is configured to not persist to disk at all.)

A `PUT` request with an empty body effectively deletes all metrics with the
Expand Down Expand Up @@ -351,12 +357,14 @@ The default port the Pushgateway is listening to is 9091. The path looks like:
The Pushgateway exposes the following metrics via the configured
`--web.telemetry-path` (default: `/metrics`):
- The pushed metrics.
- For each pushed group, a metric `push_time_seconds` as explained above.
- For each pushed group, a metric `push_time_seconds` and
`push_failure_time_seconds` as explained above.
- The usual metrics provided by the [Prometheus Go client library](https://github.com/prometheus/client_golang), i.e.:
- `process_...`
- `go_...`
- `promhttp_metric_handler_requests_...`
- A number of metrics specific to the Pushgateway, as documented by the example scrape below.
- A number of metrics specific to the Pushgateway, as documented by the example
scrape below.

```
# HELP pushgateway_build_info A metric with a constant '1' value labeled by version, revision, branch, and goversion from which pushgateway was built.
Expand Down Expand Up @@ -385,7 +393,23 @@ pushgateway_http_requests_total{code="202",handler="push",method="post"} 6
pushgateway_http_requests_total{code="400",handler="push",method="post"} 2
```


### Alerting on failed pushes

It is in general a good idea to alert on `push_time_seconds` being much farther
behind than expected. This will catch both failed pushes as well as pushers
being down completely.

To detect failed pushes much earlier, alert on `push_failure_time_seconds >
push_time_seconds`.

Pushes can also fail because they are malformed. In this case, they never reach
any metric group and therefore won't set any `push_failure_time_seconds`
metrics. Those pushes are still counted as
`pushgateway_http_requests_total{code="400",handler="push"}`. You can alert on
the `rate` of this metric, but you have to inspect the logs to identify the
offending pusher.

## Development

The normal binary embeds the web files in the `resources` directory.
Expand Down
8 changes: 4 additions & 4 deletions asset/assets_vfsdata.go

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ require (
github.com/shurcooL/httpfs v0.0.0-20190707220628-8d4bc4ba7749 // indirect
github.com/shurcooL/vfsgen v0.0.0-20181202132449-6a9ea43bcacd
golang.org/x/sys v0.0.0-20190909082730-f460065e899a // indirect
golang.org/x/tools v0.0.0-20190731214159-1e85ed8060aa // indirect
golang.org/x/tools v0.0.0-20190919031856-7460b8e10b7e // indirect
gopkg.in/alecthomas/kingpin.v2 v2.2.6
)

Expand Down
11 changes: 9 additions & 2 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ github.com/go-logfmt/logfmt v0.4.0 h1:MP4Eh7ZCb31lleYCFuwm0oe4/YGak+5l1vA2NOE80n
github.com/go-logfmt/logfmt v0.4.0/go.mod h1:3RMwSq7FuexP4Kalkev3ejPJsZTpXXBr9+V4qmtdjCk=
github.com/go-stack/stack v1.8.0 h1:5SgMzNM5HxrEjV0ww2lTmX6E2Izsfxas4+YHWRs3Lsk=
github.com/go-stack/stack v1.8.0/go.mod h1:v0f6uXyyMGvRgIKkXu+yp6POWl0qKG85gN/melR3HDY=
github.com/gogo/protobuf v1.1.1 h1:72R+M5VuhED/KujmZVcIquuo8mBgX4oVda//DQb3PXo=
github.com/gogo/protobuf v1.1.1/go.mod h1:r8qH/GZQm5c6nD/R0oafs1akxWv10x8SbQlK7atdtwQ=
github.com/golang/protobuf v1.2.0 h1:P3YflyNX/ehuJFLhxviNdFxQPkGK5cDcApsge1SqnvM=
github.com/golang/protobuf v1.2.0/go.mod h1:6lQm79b+lXiMfvg/cZm0SGofjICqVBUtrP5yJMmIC1U=
Expand Down Expand Up @@ -95,10 +96,12 @@ golang.org/x/crypto v0.0.0-20190308221718-c2843e01d9a2 h1:VklqNMn3ovrHsnt90Pveol
golang.org/x/crypto v0.0.0-20190308221718-c2843e01d9a2/go.mod h1:djNgcEr1/C05ACkg1iLfiJU5Ep61QUkGW8qpdssI0+w=
golang.org/x/net v0.0.0-20181114220301-adae6a3d119a/go.mod h1:mL1N/T3taQHkDXs73rZJwtUhF3w3ftmwwsq0BUmARs4=
golang.org/x/net v0.0.0-20190613194153-d28f0bde5980/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s=
golang.org/x/net v0.0.0-20190620200207-3b0461eec859 h1:R/3boaszxrf1GEUWTVDzSKVwLmSJpwZ1yqXm8j0v2QI=
golang.org/x/net v0.0.0-20190620200207-3b0461eec859/go.mod h1:z5CRVTTTmAJ677TzLLGU+0bjPO0LkuOLi4/5GtJWs/s=
golang.org/x/sync v0.0.0-20181108010431-42b317875d0f h1:Bl/8QSvNqXvPGPGXa2z5xUTmV7VDcZyvRZ+QQXkXTZQ=
golang.org/x/sync v0.0.0-20181108010431-42b317875d0f/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
golang.org/x/sync v0.0.0-20181221193216-37e7f081c4d4/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
golang.org/x/sync v0.0.0-20190423024810-112230192c58 h1:8gQV6CLnAEikrhgkHFbMAEhagSSnXWGV915qUMm9mrU=
golang.org/x/sync v0.0.0-20190423024810-112230192c58/go.mod h1:RxMgew5VJxzue5/jJTE5uejpjVlOe/izrB70Jof72aM=
golang.org/x/sys v0.0.0-20180905080454-ebe1bf3edb33/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY=
golang.org/x/sys v0.0.0-20181116152217-5ac8a444bdc5/go.mod h1:STP8DvDyc/dI5b8T5hshtkjS+E42TnysNCUPdjciGhY=
Expand All @@ -107,10 +110,14 @@ golang.org/x/sys v0.0.0-20190801041406-cbf593c0f2f3 h1:4y9KwBHBgBNwDbtu44R5o1fdO
golang.org/x/sys v0.0.0-20190801041406-cbf593c0f2f3/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/sys v0.0.0-20190909082730-f460065e899a h1:mIzbOulag9/gXacgxKlFVwpCOWSfBT3/pDyyCwGA9as=
golang.org/x/sys v0.0.0-20190909082730-f460065e899a/go.mod h1:h1NjWce9XRLGQEsW7wpKNCjG9DtNlClVuFLEZdDNbEs=
golang.org/x/text v0.3.0 h1:g61tztE5qeGQ89tm6NTjjM9VPIm088od1l6aSorWRWg=
golang.org/x/text v0.3.0/go.mod h1:NqM8EUOU14njkJ3fqMW+pc6Ldnwhi/IjpwHt7yyuwOQ=
golang.org/x/tools v0.0.0-20190731214159-1e85ed8060aa h1:kwa/4M1dbmhZqOIqYiTtbA6JrvPwo1+jqlub2qDXX90=
golang.org/x/tools v0.0.0-20190731214159-1e85ed8060aa/go.mod h1:jcCCGcm9btYwXyDqrUWc6MKQKKGJCWEQ3AfLSRIbEuI=
golang.org/x/tools v0.0.0-20190919031856-7460b8e10b7e h1:DxffoHYXmce3WTEBU/6/5bBSV7wmPSvT+atzBfv8hJI=
golang.org/x/tools v0.0.0-20190919031856-7460b8e10b7e/go.mod h1:b+2E5dAYhXwXZwtnZ6UAqBI28+e2cm9otk0dWdXHAEo=
golang.org/x/xerrors v0.0.0-20190717185122-a985d3407aa7 h1:9zdDQZ7Thm29KFXgAX/+yaf3eVbP7djjWp/dXAppNCc=
golang.org/x/xerrors v0.0.0-20190717185122-a985d3407aa7/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
gopkg.in/alecthomas/kingpin.v2 v2.2.6 h1:jMFz6MfLP0/4fUyZle81rXUoxOBFi19VUFKVDOQfozc=
gopkg.in/alecthomas/kingpin.v2 v2.2.6/go.mod h1:FMv+mEhP44yOT+4EoQTLFTRgOQ1FBLkstjWtayDeSgw=
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
gopkg.in/yaml.v2 v2.2.1 h1:mUhvW9EsL+naU5Q3cakzfE91YhliOondGd6ZrsDBHQE=
gopkg.in/yaml.v2 v2.2.1/go.mod h1:hI93XBmqTisBFMUTm0b8Fm+jr3Dg1NNxqwp+5A1VGuI=
Loading

0 comments on commit a112ae5

Please sign in to comment.