Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc/design: add design doc for Distributed TimestampOracle #21977

Conversation

aljoscha
Copy link
Contributor

@aljoscha aljoscha commented Sep 26, 2023

Rendered: https://github.com/aljoscha/materialize/blob/adapter-distributed-ts-oracle-design-doc/doc/developer/design/20230921_distributed_ts_oracle.md

There is a companion branch which has an implementation of the distributed TimestampOracle, along with the required/enabled optimizations (see the doc for what that means): https://github.com/aljoscha/materialize/tree/adapter-distributed-ts-oracle. This is what I used to get the benchmark results for the doc.

Motivation

Part of MaterializeInc/database-issues#6316

Tips for reviewer

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • This PR includes the following user-facing behavior changes:

@aljoscha aljoscha force-pushed the adapter-distributed-ts-oracle-design-doc branch from e43140f to fc5df7d Compare September 26, 2023 14:01
Copy link
Contributor

@jkosh44 jkosh44 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I didn't see anything about how to prevent the real time oracle timestamp from racing ahead of the wall clock. Are we just going to keep the same approach or not worry about it?

Comment on lines +85 to +86
- `apply_write(write_ts)`: Marks a write at `write_ts` as completed. This has
implications for what can be returned from the other operations in the future.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to spell out what those implications are here.

Comment on lines +239 to +238
3. whenever we are not busy, determine a read timestamp for all pending peeks
using `get_read_ts()`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One note is that when we start having isolated/distributed serving processes, we'll start to lose some of the benefit of amortization, since the batching will happen per serving process. The same would have been true of confirm_leadership so I don't think it's a big deal, just wanted to point that out.

Comment on lines +430 to +445
Storing Catalog timestamps in a separate oracle seems like the right
abstraction but we have to make more oracle operations. The good thing is that
timestamp operations can be pipelined/we can make requests to both oracles
concurrently. It _does_ mean, however, that a single read now needs two oracle
operations: one for getting the Catalog `read_ts` and one for getting the
real-time `read_ts`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, I agree that this sounds like the right approach. I don't think SELECT read_ts FROM timestamp_oracle WHERE timeline = $timeline; Is going to be much faster than a query like SELECT read_ts FROM timestamp_oracle WHERE timeline IN [$real_timeline, $oracle_timeline];

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, a TimestampOracle is scoped to a timeline, and that's how the coordinator uses it. We can change that in the future and say that timeline is passed in as a parameter, and that would also allow us to add, say, a get_read_ts_multi(timelines: &[String]). Then we could use one SELECT query to get multiple timestamps.

Is that what you had in mind? I hadn't thought about that before but think it's an excellent idea. I wouldn't change the TimestampOracle interface right now, though, because it requires more refactorings in the places where the oracle is used. The data model (using one timestamp_oracle table) would be ready for that, though, so it's easy to change the interface in the future. wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that what you had in mind?

Yes, that's what I was thinking. If we go with the catalog timeline approach, then I think 100% of the read and write queries will need both a user timeline timestamp and a catalog timeline timestamp. So it probably makes sense to combine the call into a single query. Maybe even a method like get_timeline_and_catalog_read_ts(timeline: String) if you wanted something less general. Either way works though.

I wouldn't change the TimestampOracle interface right now, though, because it requires more refactorings in the places where the oracle is used. The data model (using one timestamp_oracle table) would be ready for that, though, so it's easy to change the interface in the future. wdyt?

I think that's a good approach. It makes sense to save it for later since it's only a performance optimization of something that doesn't even exist yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I like the "catalog is a timeline" approach a lot.

However, it does not provide good separation/doesn't seem a good abstraction.

I'm proposing that we store the Catalog timestamp in a separateTimestampOracle, if we decide to use oracles at all.

If that doesn't work out for some reason, IMO doing one query instead of two outweights the abstraction awkwardness. Two means more tail latency and higher crdb costs and the abstraction leakage would be minimal.

@aljoscha
Copy link
Contributor Author

I didn't see anything about how to prevent the real time oracle timestamp from racing ahead of the wall clock. Are we just going to keep the same approach or not worry about it?

@jkosh44 Is that the thing where we use peek_write_ts() and then delay writes while we're ahead?

@jkosh44
Copy link
Contributor

jkosh44 commented Sep 26, 2023

I didn't see anything about how to prevent the real time oracle timestamp from racing ahead of the wall clock. Are we just going to keep the same approach or not worry about it?

@jkosh44 Is that the thing where we use peek_write_ts() and then delay writes while we're ahead?

Yeah. I think it will continue to work as is, but maybe there's a chance for starvation if we have a ton of concurrent writes?

@aljoscha
Copy link
Contributor Author

Yeah. I think it will continue to work as is, but maybe there's a chance for starvation if we have a ton of concurrent writes?

I have not thought about it explicitly, but yes, I think it could be a problem but also think it would be fixable with the proposed oracle.

I mention using a higher precision timestamp as one of the open questions: https://github.com/aljoscha/materialize/blob/adapter-distributed-ts-oracle-design-doc/doc/developer/design/20230921_distributed_ts_oracle.md#what-sql-type-to-use-for-the-readwrite-timestamp. That would give us 1000x more timestamps within a second, but the rest of materialize would also have to work at that higher precision. Using timestamp for the oracle table right now would at least make that migration easier in the future.

And I think this optimization (which Parker independently discovered) should also help with needing fewer write timestamps: https://github.com/aljoscha/materialize/blob/adapter-distributed-ts-oracle-design-doc/doc/developer/design/20230921_distributed_ts_oracle.md#batching-of-allocate_write_ts-operations.

@aljoscha aljoscha added the T-platform-v2 Theme: Platform v2 label Sep 28, 2023
Comment on lines +430 to +445
Storing Catalog timestamps in a separate oracle seems like the right
abstraction but we have to make more oracle operations. The good thing is that
timestamp operations can be pipelined/we can make requests to both oracles
concurrently. It _does_ mean, however, that a single read now needs two oracle
operations: one for getting the Catalog `read_ts` and one for getting the
real-time `read_ts`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I like the "catalog is a timeline" approach a lot.

However, it does not provide good separation/doesn't seem a good abstraction.

I'm proposing that we store the Catalog timestamp in a separateTimestampOracle, if we decide to use oracles at all.

If that doesn't work out for some reason, IMO doing one query instead of two outweights the abstraction awkwardness. Two means more tail latency and higher crdb costs and the abstraction leakage would be minimal.

@aljoscha aljoscha force-pushed the adapter-distributed-ts-oracle-design-doc branch from 57d83a3 to 9ce5bac Compare October 4, 2023 09:01
@aljoscha aljoscha force-pushed the adapter-distributed-ts-oracle-design-doc branch from 9ce5bac to 6f433c2 Compare October 4, 2023 09:13
@aljoscha aljoscha merged commit fabcc09 into MaterializeInc:main Oct 4, 2023
@aljoscha aljoscha deleted the adapter-distributed-ts-oracle-design-doc branch October 4, 2023 09:26
@aljoscha aljoscha self-assigned this Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-platform-v2 Theme: Platform v2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants