-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
doc/design: add design doc for Distributed TimestampOracle #21977
doc/design: add design doc for Distributed TimestampOracle #21977
Conversation
e43140f
to
fc5df7d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I didn't see anything about how to prevent the real time oracle timestamp from racing ahead of the wall clock. Are we just going to keep the same approach or not worry about it?
- `apply_write(write_ts)`: Marks a write at `write_ts` as completed. This has | ||
implications for what can be returned from the other operations in the future. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to spell out what those implications are here.
3. whenever we are not busy, determine a read timestamp for all pending peeks | ||
using `get_read_ts()` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One note is that when we start having isolated/distributed serving processes, we'll start to lose some of the benefit of amortization, since the batching will happen per serving process. The same would have been true of confirm_leadership
so I don't think it's a big deal, just wanted to point that out.
Storing Catalog timestamps in a separate oracle seems like the right | ||
abstraction but we have to make more oracle operations. The good thing is that | ||
timestamp operations can be pipelined/we can make requests to both oracles | ||
concurrently. It _does_ mean, however, that a single read now needs two oracle | ||
operations: one for getting the Catalog `read_ts` and one for getting the | ||
real-time `read_ts`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I agree that this sounds like the right approach. I don't think SELECT read_ts FROM timestamp_oracle WHERE timeline = $timeline;
Is going to be much faster than a query like SELECT read_ts FROM timestamp_oracle WHERE timeline IN [$real_timeline, $oracle_timeline];
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, a TimestampOracle
is scoped to a timeline
, and that's how the coordinator uses it. We can change that in the future and say that timeline
is passed in as a parameter, and that would also allow us to add, say, a get_read_ts_multi(timelines: &[String])
. Then we could use one SELECT
query to get multiple timestamps.
Is that what you had in mind? I hadn't thought about that before but think it's an excellent idea. I wouldn't change the TimestampOracle
interface right now, though, because it requires more refactorings in the places where the oracle is used. The data model (using one timestamp_oracle
table) would be ready for that, though, so it's easy to change the interface in the future. wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that what you had in mind?
Yes, that's what I was thinking. If we go with the catalog timeline approach, then I think 100% of the read and write queries will need both a user timeline timestamp and a catalog timeline timestamp. So it probably makes sense to combine the call into a single query. Maybe even a method like get_timeline_and_catalog_read_ts(timeline: String)
if you wanted something less general. Either way works though.
I wouldn't change the TimestampOracle interface right now, though, because it requires more refactorings in the places where the oracle is used. The data model (using one timestamp_oracle table) would be ready for that, though, so it's easy to change the interface in the future. wdyt?
I think that's a good approach. It makes sense to save it for later since it's only a performance optimization of something that doesn't even exist yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I like the "catalog is a timeline" approach a lot.
However, it does not provide good separation/doesn't seem a good abstraction.
I'm proposing that we store the Catalog timestamp in a separateTimestampOracle, if we decide to use oracles at all.
If that doesn't work out for some reason, IMO doing one query instead of two outweights the abstraction awkwardness. Two means more tail latency and higher crdb costs and the abstraction leakage would be minimal.
@jkosh44 Is that the thing where we use |
Yeah. I think it will continue to work as is, but maybe there's a chance for starvation if we have a ton of concurrent writes? |
I have not thought about it explicitly, but yes, I think it could be a problem but also think it would be fixable with the proposed oracle. I mention using a higher precision timestamp as one of the open questions: https://github.com/aljoscha/materialize/blob/adapter-distributed-ts-oracle-design-doc/doc/developer/design/20230921_distributed_ts_oracle.md#what-sql-type-to-use-for-the-readwrite-timestamp. That would give us 1000x more timestamps within a second, but the rest of materialize would also have to work at that higher precision. Using And I think this optimization (which Parker independently discovered) should also help with needing fewer write timestamps: https://github.com/aljoscha/materialize/blob/adapter-distributed-ts-oracle-design-doc/doc/developer/design/20230921_distributed_ts_oracle.md#batching-of-allocate_write_ts-operations. |
Storing Catalog timestamps in a separate oracle seems like the right | ||
abstraction but we have to make more oracle operations. The good thing is that | ||
timestamp operations can be pipelined/we can make requests to both oracles | ||
concurrently. It _does_ mean, however, that a single read now needs two oracle | ||
operations: one for getting the Catalog `read_ts` and one for getting the | ||
real-time `read_ts`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I like the "catalog is a timeline" approach a lot.
However, it does not provide good separation/doesn't seem a good abstraction.
I'm proposing that we store the Catalog timestamp in a separateTimestampOracle, if we decide to use oracles at all.
If that doesn't work out for some reason, IMO doing one query instead of two outweights the abstraction awkwardness. Two means more tail latency and higher crdb costs and the abstraction leakage would be minimal.
57d83a3
to
9ce5bac
Compare
9ce5bac
to
6f433c2
Compare
Rendered: https://github.com/aljoscha/materialize/blob/adapter-distributed-ts-oracle-design-doc/doc/developer/design/20230921_distributed_ts_oracle.md
There is a companion branch which has an implementation of the distributed TimestampOracle, along with the required/enabled optimizations (see the doc for what that means): https://github.com/aljoscha/materialize/tree/adapter-distributed-ts-oracle. This is what I used to get the benchmark results for the doc.
Motivation
Part of MaterializeInc/database-issues#6316
Tips for reviewer
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.