Skip to content

Conversation

@sohurdc
Copy link

@sohurdc sohurdc commented Oct 29, 2025

What is the purpose of the change

This pull request enhances the Prometheus reporter to export the lastCheckpointExternalPath metric as an info-style metric, making it compatible with Prometheus and VictoriaMetrics.

Current Problem:

  • The lastCheckpointExternalPath metric is currently exported as a string-valued Gauge
  • Prometheus and VictoriaMetrics only support numeric values, making it impossible to store checkpoint paths
  • Users must use additional storage systems (e.g., InfluxDB) to track checkpoint paths, increasing operational complexity

Solution:

  • Export lastCheckpointExternalPath as a Prometheus info-style metric with _info suffix
  • Store the checkpoint path in a path label instead of as a metric value
  • Set the metric value to 1.0 (following Prometheus convention for info metrics)

This approach follows Prometheus best practices (similar to node_uname_info from node_exporter) and enables users to:

  1. Store checkpoint paths directly in Prometheus/VictoriaMetrics
  2. Join checkpoint paths with other checkpoint metrics via PromQL
  3. Create monitoring dashboards and alerts based on checkpoint paths

Brief change log

  • Added CHECKPOINT_PATH_METRIC_NAME constant to identify the checkpoint path metric
  • Modified createCollector() method in AbstractPrometheusReporter to detect and handle checkpoint path metrics specially
  • Added CheckpointPathInfoCollector inner class to export checkpoint path as an info-style metric
    • Appends _info suffix to the metric name
    • Stores checkpoint path in a path label
    • Sets metric value to 1.0
    • Handles null and empty path values gracefully
  • Added comprehensive unit tests in CheckpointPathInfoCollectorTest with 4 test cases

Verifying this change

This change added tests and can be verified as follows:

Unit Tests:

  • Added CheckpointPathInfoCollectorTest with 4 test cases:
    • testCheckpointPathExportedAsInfoMetric: Verifies checkpoint path is correctly exported as an info metric with path in label
    • testNullCheckpointPathReturnsEmptyList: Verifies null path values are handled correctly (returns empty list)
    • testEmptyCheckpointPathReturnsEmptyList: Verifies empty string path values are handled correctly
    • testCheckpointPathWithSpecialCharacters: Verifies special characters in paths (e.g., S3 URLs with query parameters) are preserved correctly

Integration Verification:
All existing Prometheus reporter tests pass (27/27 tests):

  • PrometheusReporterTest: 14 tests
  • PrometheusReporterTaskScopeTest: 5 tests
  • PrometheusPushGatewayReporterTest: 4 tests
  • CheckpointPathInfoCollectorTest: 4 tests (new)

Manual Verification:
The change can be manually verified by:

  1. Starting a Flink cluster with Prometheus reporter enabled
  2. Running a job with checkpointing enabled
  3. Querying the Prometheus metrics endpoint (e.g., curl http://localhost:9249/metrics)
  4. Verifying the output contains:
    flink_jobmanager_job_lastCheckpointExternalPath_info{job_id="...",job_name="...",path="hdfs://..."} 1.0
    
  5. Using PromQL to join with other metrics:
    flink_jobmanager_job_lastCheckpointSize 
      * on(job_id) group_left(path) 
      flink_jobmanager_job_lastCheckpointExternalPath_info
    

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no (only affects metric reporting, not checkpoint functionality)
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? JavaDocs

Documentation Details:

  • Comprehensive JavaDoc added to CheckpointPathInfoCollector class explaining:
    • Purpose: Export checkpoint path as Prometheus info-style metric
    • Behavior: Path stored in label, value always 1.0
    • Example output format
  • Inline code comments explaining the special handling logic
  • Unit test documentation demonstrating usage patterns

Additional Documentation (if requested):
If the community requires, I can add documentation to docs/content/docs/deployment/metric_reporters.md explaining:

  • The info-style metric format for checkpoint paths
  • PromQL query examples for joining with other metrics
  • Use cases for monitoring and alerting

…etheus info-style metric

This commit enhances the Prometheus reporter to export lastCheckpointExternalPath
as an info-style metric, making it compatible with Prometheus and VictoriaMetrics.

Changes:
- Add CheckpointPathInfoCollector to handle checkpoint path metrics
- Export checkpoint path as label value with metric value 1.0
- Add _info suffix to follow Prometheus naming convention
- Add comprehensive unit tests (4 test cases)

The checkpoint path is now exported as:
flink_jobmanager_job_lastCheckpointExternalPath_info{...,path=\"hdfs://...\"} 1.0

This allows users to:
1. Store checkpoint paths in Prometheus/VictoriaMetrics
2. Join checkpoint path with other metrics via PromQL
3. Create dashboards and alerts based on checkpoint p
@flinkbot
Copy link
Collaborator

flinkbot commented Oct 29, 2025

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

if (scopedMetricName.endsWith(CHECKPOINT_PATH_METRIC_NAME)) {
// Add _info suffix to follow Prometheus naming convention for info metrics
String infoMetricName = scopedMetricName + "_info";
@SuppressWarnings("unchecked")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the reason to suppress unchecked warnings here. would we not want to log out these warnings?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review! In my opinion, the @SuppressWarnings("unchecked") is necessary here because:

  1. Type Safety at Runtime: The Flink metrics system uses Gauge<?> as the generic type, but we need to cast it to Gauge<String> to access the checkpoint path value. This is a safe cast in this specific context because we've already verified that this metric is lastCheckpointExternalPath, which is always a string-valued gauge.

  2. Why Not Log: We don't want to log this as a warning because:

    • This is an expected and intentional cast, not an error condition
    • The metric name check (scopedMetricName.endsWith(CHECKPOINT_PATH_METRIC_NAME)) ensures we only perform this cast for the correct metric type
    • If the cast fails (which shouldn't happen in practice), it will throw a ClassCastException that will be caught and logged by the caller
  3. Alternative Considered: We could use instanceof check, but it would be redundant since the metric name already uniquely identifies this as a string gauge.

If you prefer, I can add a comment explaining this, or we could add an instanceof check for extra safety:

if (scopedMetricName.endsWith(CHECKPOINT_PATH_METRIC_NAME)) {
    if (metric instanceof Gauge) {
        @SuppressWarnings("unchecked")
        Gauge<String> pathGauge = (Gauge<String>) metric;
        // ... rest of the code
    } else {
        log.warn("Expected Gauge for checkpoint path metric, but got: {}", 
                 metric.getClass().getName());
        return null;
    }
}

@github-actions github-actions bot added the community-reviewed PR has been reviewed by the community. label Oct 29, 2025
@sohurdc sohurdc requested a review from davidradl October 31, 2025 08:41
…etheus info-style metric

This commit enhances the Prometheus reporter to export lastCheckpointExternalPath
as an info-style metric, making it compatible with Prometheus and VictoriaMetrics.

Changes:
- Add CheckpointPathInfoCollector to handle checkpoint path metrics
- Export checkpoint path as label value with metric value 1.0
- Add _info suffix to follow Prometheus naming convention
- Add comprehensive unit tests (4 test cases)

The checkpoint path is now exported as:
flink_jobmanager_job_lastCheckpointExternalPath_info{...,path=\"hdfs://...\"} 1.0

This allows users to:
1. Store checkpoint paths in Prometheus/VictoriaMetrics
2. Join checkpoint path with other metrics via PromQL
3. Create dashboards and alerts based on checkpoint p
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-reviewed PR has been reviewed by the community.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants