[FLINK-21309][metrics] Add UUID as grouping key for prom pushgateway#27744
[FLINK-21309][metrics] Add UUID as grouping key for prom pushgateway#27744qinghui-xu wants to merge 2 commits intoapache:masterfrom
Conversation
Previously we are relying on random uuid suffix of `job` label to avoid metric collisions among taskmangers. This makes it difficult to aggregate over `job` label. Now we use the uuid as grouping key instead, while respecting prometheus guide line on `job` label usage.
|
@flinkbot run azure |
| */ | ||
| @PublicEvolving | ||
| public class PrometheusPushGatewayReporter extends AbstractPrometheusReporter implements Scheduled { | ||
| public static final String REPORTER_ID_GROUPPING_KEY = |
There was a problem hiding this comment.
Minor spelling nit: I think we want this to be called `REPORTER_ID_GROUPING_KEY.
| .defaultValue("") | ||
| .withDescription("The job name under which metrics will be pushed"); | ||
|
|
||
| public static final ConfigOption<Boolean> RANDOM_JOB_NAME_SUFFIX = |
There was a problem hiding this comment.
Are we sure we want to remove this configuration entirely? I’d imagine there are existing jobs and/or users already relying on it. This would also require updating the associated documentation and, at a minimum, feels like something that should be deprecated before being removed.
There was a problem hiding this comment.
Good point. Let me deprecate it in the PR. I'm wondering should I also change the default value to false while keeping it. I'll use this flag as a switch: if job is not suffixed, a "flink_prometheus_push_gateway_reporter_id" grouping key will be used.
There was a problem hiding this comment.
Sent a followup commit in this sense.
|
|
||
| GroupingKeyMap(Map<String, String> customGroupingKeys) { | ||
| if (customGroupingKeys.containsKey(REPORTER_ID_GROUPPING_KEY)) { | ||
| throw new IllegalArgumentException( |
There was a problem hiding this comment.
We'll want to add a test here to cover the assertion against the user supplying "flink_prometheus_push_gateway_reporter_id" as an explicit grouping (asserting the expected exception).
- Keep `randomJobNameSuffix` config as deprecated - When `randomJobNameSuffix` is disabled (by default), group metrics using random reporter id
What is the purpose of the change
Previously we are relying on random uuid suffix of
joblabel to avoid metric collisions among taskmangers. This makes it difficult to aggregate overjoblabel.Now we use the uuid as grouping key (
flink_prometheus_push_gateway_reporter_id) instead, while respecting prometheus guide line onjoblabel usage.Brief change log
randomJobNameSuffixflag from the config optionsjoblabel will not be suffixedVerifying this change
Please make sure both new and modified tests in this PR follow the conventions for tests defined in our code quality guide.
This change added tests and can be verified as follows:
Does this pull request potentially affect one of the following parts:
@Public(Evolving): (no)Documentation