Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

controller: export wallclock lag metrics also for storage collections #30568

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

teskje
Copy link
Contributor

@teskje teskje commented Nov 19, 2024

This PR rewires the existing mz_dataflow_wallclock_lag_seconds metric so that it also includes storage collections. To this end, a new ControllerMetrics type is introduced to define metrics that are exported by both the compute and the storage controller, and the wallclock lag metrics are moved there. The ControllerMetrics type is then passed to both controllers, so they can export wallclock lag metrics for their respective collections.

Note that in contrast to compute collections, the wallclock lag for storage collections is not per replica (as we are comparing with the global persist frontier here), and in some cases not even per cluster (as not all storage collections are associated with clusters). As a result, the replica_id label is always empty for storage collections, and the instance_id label is sometimes empty.

Motivation

  • This PR adds a known-desirable feature.

Part of https://github.com/MaterializeInc/database-issues/issues/8235

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

In preparation of having the storage controller export wallclock lag
metrics too, this commit factors out the common infrastructure from
`mz-compute-client` and moves it into `mz-cluster-client`.
This commit makes minor changes to the code structure around
`CollectionState` initialization in the storage controller. This removes
some redundancy, but more importantly makes it easier to attach
`WallclockLagMetrics` to the `CollectionState` in the next commit.
This commit wires the `ControllerMetrics` through to the storage
controller, creates `WallclockLagMetrics` objects in the
`CollectionState`s/`ExportState`s of all storage collections, and uses
those to update the wallclock lag metrics during every maintenance call.
@teskje teskje marked this pull request as ready for review November 21, 2024 17:04
@teskje teskje requested a review from a team as a code owner November 21, 2024 17:04
Copy link

shepherdlybot bot commented Nov 21, 2024

Risk Score:81 / 100 Bug Hotspots:2 Resilience Coverage:0%

Mitigations

Completing required mitigations increases Resilience Coverage.

  • (Required) Code Review
  • (Required) Feature Flag
  • (Required) Integration Test
  • (Required) Observability
  • (Required) QA Review
  • (Required) Run Nightly Tests
  • Unit Test
Risk Summary:

The pull request has a high-risk score of 81, driven by the predictors "Sum Bug Reports Of Files" and "Delta of Executable Lines." Historically, pull requests with these predictors are 115% more likely to cause a bug compared to the repository baseline. The repository's observed bug trend remains steady.

Note: The risk score is not based on semantic analysis but on historical predictors of bug occurrence in the repository. The attributes above were deemed the strongest predictors based on that history. Predictors and the score may change as the PR evolves in code, time, and review activity.

Bug Hotspots:
What's This?

File Percentile
../src/lib.rs 98
../controller/instance.rs 99

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant