Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc: add read holds design doc #23158

Merged
merged 4 commits into from
Nov 29, 2023

Conversation

maddyblue
Copy link
Contributor

See MaterializeInc/database-issues#6895, MaterializeInc/database-issues#5094, #22976

Motivation

  • This PR adds a known-desirable feature.

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • This PR includes the following user-facing behavior changes:
    • n/a

doc/developer/design/20231102_read_holds.md Show resolved Hide resolved
If `TO <time>` is not specified, the new time is the least common physical read frontier of all objects.
The `HOLD` releases its current read hold and aquires a new read hold at the new time.

The `ERROR_IF_OLDER_THAN` option (required, but defaults to `'24h'`) is a duration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we start with a more aggressive default? If mz's big value prop is O(minutes) freshness, than feels to me like this should also be O(minutes)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to 3h: 2h for our documented maintenance window + 1h for users to catch up. I have no opinion here, so am open to suggestions for longer or shorter.

doc/developer/design/20231102_read_holds.md Outdated Show resolved Hide resolved
Copy link
Contributor

@frankmcsherry frankmcsherry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several in-line comments, but mostly

  1. uncertainty around whether manual advancement of read holds is more or less error prone than configuring an automatically advancing window. The design seems mostly in support of durable subscribes rather than other reasons you might want to modify compaction windows (e.g. to increase the odds of aligning inputs that are different amounts up to date, like matview stacks).
  2. uncertainty around how you control the initial state of these holds, and what the hold-free compaction behavior is. It seems likely that any created object will be compacted up to "now" at which point you can create holds on it, which again seems helpful for new subscribes, but not obviously for other tasks.

doc/developer/design/20231102_read_holds.md Outdated Show resolved Hide resolved
doc/developer/design/20231102_read_holds.md Outdated Show resolved Hide resolved
doc/developer/design/20231102_read_holds.md Outdated Show resolved Hide resolved
The `HOLD` releases its current read hold and aquires a new read hold at the new time.

The `ERROR_IF_OLDER_THAN` option (required, but defaults to `'24h'`) is a duration.
If difference between the least common physical read frontier of the objects and the `HOLD` time is greater than `ERROR_IF_OLDER_THAN`, the `HOLD`'s read hold is released and the `HOLD` enters an error state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the least common physical read frontier of the objects" seems important here; what does it mean though? Should this be write frontier, instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, it seems more ergonomic to indicate "I would like this hold to lag the write frontier by duration" rather than ask for repeated manual advancement and errors if you don't make it in time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Which is what we call a "custom compaction window" I think!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that "physical read frontier" needs a better definition, I'm working on that. I don't care too much if this is calculated from the write frontier or some read frontier.

I tried to describe why it would be wrong to automatically advance, and convinced myself we should automatically advance. I was trying to identify a situation in which a user would silently not realize they had missed a timestamp. But, if they are subscribing and need to restart, they are already doing an AS OF from their last timestamp, and either an error state or automatic advancement will correctly tell them if their timestamp is too old.

Updated text to say these are automatically advanced.

doc/developer/design/20231102_read_holds.md Show resolved Hide resolved
Comment on lines +129 to +142
- Allow users to configure the compaction window of objects.
This has problems: the window could be too short for recovery or too long and overuse resources, OOMing a cluster.
Having users routinely advance a `HOLD` should scale the extra resource use to exactly how far behind a user is.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These problems seem like the exist with read holds as well. A compaction window seems 100x less likely to result in OOMing a cluster than a hold that someone needs to remember to refresh, or when an automated refreshing task experiences an ongoing fault and can't be run. An ongoing automated refreshing task sounds very much like a compaction window we ask someone else to run for us. It seems like the positive, that the amount of history adapts to the use, is specifically tailored to subscribes, but I would want to make sure our compaction controls are general enough for multiple use cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. Changed MAX LAG to automatically advance, but still keeping manual advancement to allow for even more compaction opportunities. Thoughts about also removing manual compaction completely? Does it add too much complication for users, or are its memory benefits worth it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts about also removing manual compaction completely? Does it add too much complication for users, or are its memory benefits worth it?

I think the memory benefits are worth it! But, agreed that not every user will want the complexity of holds. That's why I've been pushing for re-enabling user-configurable compaction windows as part of this bundle of work: https://github.com/MaterializeInc/materialize/issues/16701. Is that still the plan?

IMO, the reason to do holds is so that you can drive advancement faster in the happy case. For users who just want a grace period where they can reconnect and resume where they left off, I think just setting a 5m (15m? 60m?) logical compaction window is a good way to go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not know user compaction was also desired now. Are we doing both that (at some point) and holds?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I had mentioned to @chaas at some point! I'd love to do holds and user-definable compaction windows roughly simultaneously, as part of the same bundle of work.

Copy link
Contributor

@chaas chaas Nov 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I must've missed/dropped that we wanted to include re-enabling user-configurable compaction windows more generally in this first phase of work. I had originally imagined us releasing read holds first to unblock durable subscribes for the reverse ETL use case, then moving on to general purpose user controlled compaction windows.
Do we know what the product use cases are specifically that we're supporting with them?

But if it's more efficient to design & build both together then I'm supportive.

I'm also fine with supporting only automatically advancing read holds (which is basically the same thing as custom compaction afaiu) as Matt suggested, if that narrows scope. The user will already need to provision their cluster for the worst-case scenario, so perhaps the manual advancement is not that useful?

@benesch I remember chatting about this, but were you thinking the benefit of having both read holds and custom compaction even if they're the same under the hood is that read holds are a more intuitive UX for the durable subscribe case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know what the product use cases are specifically that we're supporting with them?

Potentially similar product use cases to read holds! It may be that rETL tools would prefer to simply require a custom compaction window of, say, 1hr on all the models they read from, and on their side they will try very hard to recover from any failures in 1hr, accepting responsibility for processing a new snapshot.

I would think of this all as discovery work. We don’t know exactly how users will use these features ahead of time, and they’re in the weeds enough that we can’t meaningfully engage with users until we have working versions. It’s been realllly hard to get anyone to engage with read holds in the abstract. But I’m confident enough that both of these features are sensible and valuable that I’d like to advocate for introducing them both in private preview, and then we can iterate on them behind private preview as users show up with real use cases for each.

FWIW, if we have to choose on ordering, I’d rather re-add support for user configurable compaction windows first. (I was hoping we could do both concurrently, as read holds have some thorny design problems that configurable compaction windows do not have.) Configurable compaction windows are substantially simpler than read holds, and could potentially unblock rETL tools faster than building full support for read holds. The rETL tools could get started building resumable SUBSCRIBEs based on relations with, say, 1hr compaction windows, and then later, once we’ve shipped read holds, could learn to use read holds to substantially reduce resource requirements when things are keeping up.

I'm also fine with supporting only automatically advancing read holds (which is basically the same thing as custom compaction afaiu) as Matt suggested, if that narrows scope. The user will already need to provision their cluster for the worst-case scenario, so perhaps the manual advancement is not that useful?

The raison d’etre of read holds is that they support manual advancement! If we don’t support manual advancement of read holds, IMO we should not build read holds, and just do user-configurable compaction windows. At least, I can’t see what read holds that only support manual advancement offer over user-configurable compaction policies.

That said: I think read holds are an incredibly valuable tool, and one I’m very bullish on building. I’d like to ultimately support them with ~no restriction on the maximum value for MAX LAG. You wouldn’t actually need to provision for the worst case scenario, I don’t think. You’d provision for the steady state, and then if your app had an issue that meant it failed to keep up with the upstream system and a bunch of historical revisions piled up, after bringing your app back online, you could temporarily size up your cluster to get through the backlog, then size it back down.

doc/developer/design/20231102_read_holds.md Outdated Show resolved Hide resolved
doc/developer/design/20231102_read_holds.md Outdated Show resolved Hide resolved
doc/developer/design/20231102_read_holds.md Outdated Show resolved Hide resolved
doc/developer/design/20231102_read_holds.md Outdated Show resolved Hide resolved
The `HOLD` releases its current read hold and aquires a new read hold at the new time.

The `ERROR_IF_OLDER_THAN` option (required, but defaults to `'24h'`) is a duration.
If difference between the least common physical read frontier of the objects and the `HOLD` time is greater than `ERROR_IF_OLDER_THAN`, the `HOLD`'s read hold is released and the `HOLD` enters an error state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Which is what we call a "custom compaction window" I think!)

doc/developer/design/20231102_read_holds.md Outdated Show resolved Hide resolved
doc/developer/design/20231102_read_holds.md Show resolved Hide resolved
doc/developer/design/20231102_read_holds.md Outdated Show resolved Hide resolved
Copy link
Member

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excited about this!

doc/developer/design/20231102_read_holds.md Outdated Show resolved Hide resolved
doc/developer/design/20231102_read_holds.md Outdated Show resolved Hide resolved
doc/developer/design/20231102_read_holds.md Outdated Show resolved Hide resolved
doc/developer/design/20231102_read_holds.md Show resolved Hide resolved
doc/developer/design/20231102_read_holds.md Show resolved Hide resolved
doc/developer/design/20231102_read_holds.md Outdated Show resolved Hide resolved
doc/developer/design/20231102_read_holds.md Outdated Show resolved Hide resolved
doc/developer/design/20231102_read_holds.md Outdated Show resolved Hide resolved
doc/developer/design/20231102_read_holds.md Outdated Show resolved Hide resolved
doc/developer/design/20231102_read_holds.md Show resolved Hide resolved
doc/developer/design/20231102_read_holds.md Outdated Show resolved Hide resolved
Comment on lines +129 to +142
- Allow users to configure the compaction window of objects.
This has problems: the window could be too short for recovery or too long and overuse resources, OOMing a cluster.
Having users routinely advance a `HOLD` should scale the extra resource use to exactly how far behind a user is.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts about also removing manual compaction completely? Does it add too much complication for users, or are its memory benefits worth it?

I think the memory benefits are worth it! But, agreed that not every user will want the complexity of holds. That's why I've been pushing for re-enabling user-configurable compaction windows as part of this bundle of work: https://github.com/MaterializeInc/materialize/issues/16701. Is that still the plan?

IMO, the reason to do holds is so that you can drive advancement faster in the happy case. For users who just want a grace period where they can reconnect and resume where they left off, I think just setting a 5m (15m? 60m?) logical compaction window is a good way to go.

doc/developer/design/20231102_read_holds.md Outdated Show resolved Hide resolved
doc/developer/design/20231102_read_holds.md Show resolved Hide resolved
Comment on lines +129 to +142
- Allow users to configure the compaction window of objects.
This has problems: the window could be too short for recovery or too long and overuse resources, OOMing a cluster.
Having users routinely advance a `HOLD` should scale the extra resource use to exactly how far behind a user is.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not know user compaction was also desired now. Are we doing both that (at some point) and holds?

doc/developer/design/20231102_read_holds.md Outdated Show resolved Hide resolved
doc/developer/design/20231102_read_holds.md Outdated Show resolved Hide resolved
doc/developer/design/20231102_read_holds.md Show resolved Hide resolved
doc/developer/design/20231102_read_holds.md Outdated Show resolved Hide resolved
doc/developer/design/20231102_read_holds.md Outdated Show resolved Hide resolved
doc/developer/design/20231102_read_holds.md Outdated Show resolved Hide resolved
doc/developer/design/20231102_read_holds.md Show resolved Hide resolved
@maddyblue
Copy link
Contributor Author

Thanks for the feedback all! This is feeling really good now.

doc/developer/design/20231102_read_holds.md Outdated Show resolved Hide resolved
Comment on lines +129 to +142
- Allow users to configure the compaction window of objects.
This has problems: the window could be too short for recovery or too long and overuse resources, OOMing a cluster.
Having users routinely advance a `HOLD` should scale the extra resource use to exactly how far behind a user is.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I had mentioned to @chaas at some point! I'd love to do holds and user-definable compaction windows roughly simultaneously, as part of the same bundle of work.

doc/developer/design/20231102_read_holds.md Outdated Show resolved Hide resolved
doc/developer/design/20231102_read_holds.md Outdated Show resolved Hide resolved
doc/developer/design/20231102_read_holds.md Outdated Show resolved Hide resolved
doc/developer/design/20231102_read_holds.md Show resolved Hide resolved
doc/developer/design/20231102_read_holds.md Outdated Show resolved Hide resolved
Copy link
Contributor

@jkosh44 jkosh44 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from my perspective.

Copy link
Contributor Author

@maddyblue maddyblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that addresses all comments! Current plan is: merge this without discussing user compaction, and do user compaction in another PR/design doc.

doc/developer/design/20231102_read_holds.md Outdated Show resolved Hide resolved
doc/developer/design/20231102_read_holds.md Outdated Show resolved Hide resolved
doc/developer/design/20231102_read_holds.md Outdated Show resolved Hide resolved
@benesch
Copy link
Member

benesch commented Nov 28, 2023

Nice work! 🙇🏽 This looks really good.

@maddyblue maddyblue merged commit 96754ba into MaterializeInc:main Nov 29, 2023
9 checks passed
@maddyblue maddyblue deleted the read-holds-design branch November 29, 2023 04:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants