-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
doc: add read holds design doc #23158
Conversation
If `TO <time>` is not specified, the new time is the least common physical read frontier of all objects. | ||
The `HOLD` releases its current read hold and aquires a new read hold at the new time. | ||
|
||
The `ERROR_IF_OLDER_THAN` option (required, but defaults to `'24h'`) is a duration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we start with a more aggressive default? If mz's big value prop is O(minutes) freshness, than feels to me like this should also be O(minutes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to 3h: 2h for our documented maintenance window + 1h for users to catch up. I have no opinion here, so am open to suggestions for longer or shorter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several in-line comments, but mostly
- uncertainty around whether manual advancement of read holds is more or less error prone than configuring an automatically advancing window. The design seems mostly in support of durable subscribes rather than other reasons you might want to modify compaction windows (e.g. to increase the odds of aligning inputs that are different amounts up to date, like matview stacks).
- uncertainty around how you control the initial state of these holds, and what the hold-free compaction behavior is. It seems likely that any created object will be compacted up to "now" at which point you can create holds on it, which again seems helpful for new subscribes, but not obviously for other tasks.
The `HOLD` releases its current read hold and aquires a new read hold at the new time. | ||
|
||
The `ERROR_IF_OLDER_THAN` option (required, but defaults to `'24h'`) is a duration. | ||
If difference between the least common physical read frontier of the objects and the `HOLD` time is greater than `ERROR_IF_OLDER_THAN`, the `HOLD`'s read hold is released and the `HOLD` enters an error state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"the least common physical read frontier of the objects" seems important here; what does it mean though? Should this be write frontier, instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If so, it seems more ergonomic to indicate "I would like this hold to lag the write frontier by duration
" rather than ask for repeated manual advancement and errors if you don't make it in time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Which is what we call a "custom compaction window" I think!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that "physical read frontier" needs a better definition, I'm working on that. I don't care too much if this is calculated from the write frontier or some read frontier.
I tried to describe why it would be wrong to automatically advance, and convinced myself we should automatically advance. I was trying to identify a situation in which a user would silently not realize they had missed a timestamp. But, if they are subscribing and need to restart, they are already doing an AS OF from their last timestamp, and either an error state or automatic advancement will correctly tell them if their timestamp is too old.
Updated text to say these are automatically advanced.
- Allow users to configure the compaction window of objects. | ||
This has problems: the window could be too short for recovery or too long and overuse resources, OOMing a cluster. | ||
Having users routinely advance a `HOLD` should scale the extra resource use to exactly how far behind a user is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These problems seem like the exist with read holds as well. A compaction window seems 100x less likely to result in OOMing a cluster than a hold that someone needs to remember to refresh, or when an automated refreshing task experiences an ongoing fault and can't be run. An ongoing automated refreshing task sounds very much like a compaction window we ask someone else to run for us. It seems like the positive, that the amount of history adapts to the use, is specifically tailored to subscribes, but I would want to make sure our compaction controls are general enough for multiple use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. Changed MAX LAG
to automatically advance, but still keeping manual advancement to allow for even more compaction opportunities. Thoughts about also removing manual compaction completely? Does it add too much complication for users, or are its memory benefits worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts about also removing manual compaction completely? Does it add too much complication for users, or are its memory benefits worth it?
I think the memory benefits are worth it! But, agreed that not every user will want the complexity of holds. That's why I've been pushing for re-enabling user-configurable compaction windows as part of this bundle of work: https://github.com/MaterializeInc/materialize/issues/16701. Is that still the plan?
IMO, the reason to do holds is so that you can drive advancement faster in the happy case. For users who just want a grace period where they can reconnect and resume where they left off, I think just setting a 5m (15m? 60m?) logical compaction window is a good way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not know user compaction was also desired now. Are we doing both that (at some point) and holds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I had mentioned to @chaas at some point! I'd love to do holds and user-definable compaction windows roughly simultaneously, as part of the same bundle of work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I must've missed/dropped that we wanted to include re-enabling user-configurable compaction windows more generally in this first phase of work. I had originally imagined us releasing read holds first to unblock durable subscribes for the reverse ETL use case, then moving on to general purpose user controlled compaction windows.
Do we know what the product use cases are specifically that we're supporting with them?
But if it's more efficient to design & build both together then I'm supportive.
I'm also fine with supporting only automatically advancing read holds (which is basically the same thing as custom compaction afaiu) as Matt suggested, if that narrows scope. The user will already need to provision their cluster for the worst-case scenario, so perhaps the manual advancement is not that useful?
@benesch I remember chatting about this, but were you thinking the benefit of having both read holds and custom compaction even if they're the same under the hood is that read holds are a more intuitive UX for the durable subscribe case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know what the product use cases are specifically that we're supporting with them?
Potentially similar product use cases to read holds! It may be that rETL tools would prefer to simply require a custom compaction window of, say, 1hr on all the models they read from, and on their side they will try very hard to recover from any failures in 1hr, accepting responsibility for processing a new snapshot.
I would think of this all as discovery work. We don’t know exactly how users will use these features ahead of time, and they’re in the weeds enough that we can’t meaningfully engage with users until we have working versions. It’s been realllly hard to get anyone to engage with read holds in the abstract. But I’m confident enough that both of these features are sensible and valuable that I’d like to advocate for introducing them both in private preview, and then we can iterate on them behind private preview as users show up with real use cases for each.
FWIW, if we have to choose on ordering, I’d rather re-add support for user configurable compaction windows first. (I was hoping we could do both concurrently, as read holds have some thorny design problems that configurable compaction windows do not have.) Configurable compaction windows are substantially simpler than read holds, and could potentially unblock rETL tools faster than building full support for read holds. The rETL tools could get started building resumable SUBSCRIBEs based on relations with, say, 1hr compaction windows, and then later, once we’ve shipped read holds, could learn to use read holds to substantially reduce resource requirements when things are keeping up.
I'm also fine with supporting only automatically advancing read holds (which is basically the same thing as custom compaction afaiu) as Matt suggested, if that narrows scope. The user will already need to provision their cluster for the worst-case scenario, so perhaps the manual advancement is not that useful?
The raison d’etre of read holds is that they support manual advancement! If we don’t support manual advancement of read holds, IMO we should not build read holds, and just do user-configurable compaction windows. At least, I can’t see what read holds that only support manual advancement offer over user-configurable compaction policies.
That said: I think read holds are an incredibly valuable tool, and one I’m very bullish on building. I’d like to ultimately support them with ~no restriction on the maximum value for MAX LAG
. You wouldn’t actually need to provision for the worst case scenario, I don’t think. You’d provision for the steady state, and then if your app had an issue that meant it failed to keep up with the upstream system and a bunch of historical revisions piled up, after bringing your app back online, you could temporarily size up your cluster to get through the backlog, then size it back down.
The `HOLD` releases its current read hold and aquires a new read hold at the new time. | ||
|
||
The `ERROR_IF_OLDER_THAN` option (required, but defaults to `'24h'`) is a duration. | ||
If difference between the least common physical read frontier of the objects and the `HOLD` time is greater than `ERROR_IF_OLDER_THAN`, the `HOLD`'s read hold is released and the `HOLD` enters an error state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Which is what we call a "custom compaction window" I think!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excited about this!
- Allow users to configure the compaction window of objects. | ||
This has problems: the window could be too short for recovery or too long and overuse resources, OOMing a cluster. | ||
Having users routinely advance a `HOLD` should scale the extra resource use to exactly how far behind a user is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts about also removing manual compaction completely? Does it add too much complication for users, or are its memory benefits worth it?
I think the memory benefits are worth it! But, agreed that not every user will want the complexity of holds. That's why I've been pushing for re-enabling user-configurable compaction windows as part of this bundle of work: https://github.com/MaterializeInc/materialize/issues/16701. Is that still the plan?
IMO, the reason to do holds is so that you can drive advancement faster in the happy case. For users who just want a grace period where they can reconnect and resume where they left off, I think just setting a 5m (15m? 60m?) logical compaction window is a good way to go.
- Allow users to configure the compaction window of objects. | ||
This has problems: the window could be too short for recovery or too long and overuse resources, OOMing a cluster. | ||
Having users routinely advance a `HOLD` should scale the extra resource use to exactly how far behind a user is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not know user compaction was also desired now. Are we doing both that (at some point) and holds?
Thanks for the feedback all! This is feeling really good now. |
- Allow users to configure the compaction window of objects. | ||
This has problems: the window could be too short for recovery or too long and overuse resources, OOMing a cluster. | ||
Having users routinely advance a `HOLD` should scale the extra resource use to exactly how far behind a user is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I had mentioned to @chaas at some point! I'd love to do holds and user-definable compaction windows roughly simultaneously, as part of the same bundle of work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from my perspective.
3004ed1
to
8572758
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that addresses all comments! Current plan is: merge this without discussing user compaction, and do user compaction in another PR/design doc.
Nice work! 🙇🏽 This looks really good. |
See MaterializeInc/database-issues#6895, MaterializeInc/database-issues#5094, #22976
Motivation
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.