Skip to content

[FLINK-21309][metrics] Add UUID as grouping key for prom pushgateway#27744

Open
qinghui-xu wants to merge 2 commits intoapache:masterfrom
qinghui-xu:FLINK-21309
Open

[FLINK-21309][metrics] Add UUID as grouping key for prom pushgateway#27744
qinghui-xu wants to merge 2 commits intoapache:masterfrom
qinghui-xu:FLINK-21309

Conversation

@qinghui-xu
Copy link
Contributor

What is the purpose of the change

Previously we are relying on random uuid suffix of job label to avoid metric collisions among taskmangers. This makes it difficult to aggregate over job label.
Now we use the uuid as grouping key (flink_prometheus_push_gateway_reporter_id) instead, while respecting prometheus guide line on job label usage.

Brief change log

  • Remove the randomJobNameSuffix flag from the config options
  • job label will not be suffixed
  • Systematically add flink_prometheus_push_gateway_reporter_id = UUID as last grouping key in all case.

Verifying this change

Please make sure both new and modified tests in this PR follow the conventions for tests defined in our code quality guide.

This change added tests and can be verified as follows:

  • Added unit test to make sure that flink_prometheus_push_gateway_reporter_id = UUID is appended to user provided grouping keys.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)
  • If yes, how is the feature documented? (not applicable)

@flinkbot
Copy link
Collaborator

flinkbot commented Mar 6, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Previously we are relying on random uuid suffix of `job` label to avoid metric collisions among taskmangers.
This makes it difficult to aggregate over `job` label.
Now we use the uuid as grouping key instead, while respecting prometheus guide line on `job` label usage.
@qinghui-xu
Copy link
Contributor Author

@flinkbot run azure

*/
@PublicEvolving
public class PrometheusPushGatewayReporter extends AbstractPrometheusReporter implements Scheduled {
public static final String REPORTER_ID_GROUPPING_KEY =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor spelling nit: I think we want this to be called `REPORTER_ID_GROUPING_KEY.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

.defaultValue("")
.withDescription("The job name under which metrics will be pushed");

public static final ConfigOption<Boolean> RANDOM_JOB_NAME_SUFFIX =
Copy link
Contributor

@rionmonster rionmonster Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we want to remove this configuration entirely? I’d imagine there are existing jobs and/or users already relying on it. This would also require updating the associated documentation and, at a minimum, feels like something that should be deprecated before being removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Let me deprecate it in the PR. I'm wondering should I also change the default value to false while keeping it. I'll use this flag as a switch: if job is not suffixed, a "flink_prometheus_push_gateway_reporter_id" grouping key will be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sent a followup commit in this sense.


GroupingKeyMap(Map<String, String> customGroupingKeys) {
if (customGroupingKeys.containsKey(REPORTER_ID_GROUPPING_KEY)) {
throw new IllegalArgumentException(
Copy link
Contributor

@rionmonster rionmonster Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll want to add a test here to cover the assertion against the user supplying "flink_prometheus_push_gateway_reporter_id" as an explicit grouping (asserting the expected exception).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@github-actions github-actions bot added the community-reviewed PR has been reviewed by the community. label Mar 9, 2026
- Keep `randomJobNameSuffix` config as deprecated
- When `randomJobNameSuffix` is disabled (by default), group metrics using random reporter id
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-reviewed PR has been reviewed by the community.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants