catalog: Deserialize audit log in background #30642

jkosh44 · 2024-11-27T19:53:45Z

This commit optimizes the startup process to deserialize the audit log
in the background. Opening the durable catalog involves deserializing
all updates, migrating them, and then storing them in memory. Later the
in-memory catalog will take each update and generate a builtin table
update and apply that update in-memory. Throughout the startup process
audit logs are heavily special cased and only used to generate builtin
table updates. Additionally, audit logs are by far the largest catalog
collection and can take a long time to deserialize.

This commit creates a new thread at the beginning of the startup
process that deserializes all audit log updates. A handle for the
thread is plumbed throughout startup and then joined only when we
actually need the builtin table updates. By deserializing these updates
in the background we can reduce the total time spent in startup.

In order for this all to work, we have to disallow migrating audit log
updates, because now the audit log updates skip over migrations. In
practice this is OK, because the audit log has its own versioning
scheme that allows us to add new audit log variants without a
migration. The only thing we lose is the ability to re-write old audit
logs.

This speedup works only because other parts of startup take long enough
to hide the time spent deserializing the audit log. As other parts of
startup get faster, joining on the deserializing thread will get
slower. Some additional optimizations we can make in the future are:

Remove the audit log from the catalog.
Make the catalog shard queryable and remove the need to generate
builtin table updates.
Now that the audit log is truly append-only, we could lazily
deserialize the audit log updates in order by ID (how to order them
is something we'd need to figure out), and stop once we've found
the latest audit log update that is already in the builtin table.
In general that will usually be the first audit log update we
deserialize, unless we crashed after committing a catalog
transaction but before updating the builtin tables.

Works towards resolving #MaterializeInc/database-issues/issues/8384

Motivation

This PR adds a known-desirable feature.

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

This commit optimizes the startup process to deserialize the audit log in the background. Opening the durable catalog involves deserializing all updates, migrating them, and then storing them in memory. Later the in-memory catalog will take each update and generate a builtin table update and apply that update in-memory. Throughout the startup process audit logs are heavily special cased and only used to generate builtin table updates. Additionally, audit logs are by far the largest catalog collection and can take a long time to deserialize. This commit creates a new thread at the beginning of the startup process that deserializes all audit log updates. A handle for the thread is plumbed throughout startup and then joined only when we actually need the builtin table updates. By deserializing these updates in the background we can reduce the total time spent in startup. In order for this all to work, we have to disallow migrating audit log updates, because now the audit log updates skip over migrations. In practice this is OK, because the audit log has its own versioning scheme that allows us to add new audit log variants without a migration. The only thing we lose is the ability to re-write old audit logs. This speedup works only because other parts of startup take long enough to hide the time spent deserializing the audit log. As other parts of startup get faster, joining on the deserializing thread will get slower. Some additional optimizations we can make in the future are: - Remove the audit log from the catalog. - Make the catalog shard queryable and remove the need to generate builtin table updates. - Now that the audit log is truly append-only, we could lazily deserialize the audit log updates in order by ID (how to order them is something we'd need to figure out), and stop once we've found the latest audit log update that is already in the builtin table. In general that will usually be the first audit log update we deserialize, unless we crashed after committing a catalog transaction but before updating the builtin tables. Works towards resolving #MaterializeInc/database-issues/issues/8384

jkosh44 · 2024-11-27T21:45:02Z

In my staging env, I'm able to speed up startup times from

22.987157839s
22.062606634s
21.493512429s

to

19.273364726s
20.652808049s
19.888965235s

shepherdlybot · 2024-11-27T21:46:05Z

Mitigations

Completing required mitigations increases Resilience Coverage.

Risk Summary:

This pull request carries a high risk score of 80, primarily driven by predictors such as "Sum Bug Reports Of Files" and "Delta of Executable Lines." Historically, PRs with these predictors are 115% more likely to introduce bugs compared to the repository baseline. Additionally, there are 8 file hotspots involved. While the observed and predicted bug trends for the repository are both decreasing, caution is still advised due to the significant historical risk associated with these predictors.

Note: The risk score is not based on semantic analysis but on historical predictors of bug occurrence in the repository. The attributes above were deemed the strongest predictors based on that history. Predictors and the score may change as the PR evolves in code, time, and review activity.

Bug Hotspots:
What's This?

File	Percentile
../durable/upgrade.rs	97
../src/coord.rs	100
../src/catalog.rs	99
../durable/persist.rs	95
../src/main.rs	98
../src/lib.rs	95
../src/durable.rs	93
../durable/transaction.rs	93

jkosh44 · 2024-11-27T21:46:22Z

(FYI @benesch)

benesch · 2024-11-28T19:45:19Z

Neat! So this saves a second or two, if I'm reading your numbers correctly?

jkosh44 force-pushed the audit-log-speedup branch 4 times, most recently from 5092227 to 8cca855 Compare November 27, 2024 20:26

jkosh44 force-pushed the audit-log-speedup branch from 8cca855 to e093826 Compare November 27, 2024 21:14

jkosh44 changed the title ~~WIP: Run audit log deserialization in background~~ catalog: Deserialize audit log in background Nov 27, 2024

jkosh44 marked this pull request as ready for review November 27, 2024 21:45

jkosh44 requested a review from a team as a code owner November 27, 2024 21:45

jkosh44 requested a review from ParkMyCar November 27, 2024 21:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

catalog: Deserialize audit log in background #30642

catalog: Deserialize audit log in background #30642

jkosh44 commented Nov 27, 2024 •

edited

Loading

jkosh44 commented Nov 27, 2024

shepherdlybot bot commented Nov 27, 2024 •

edited

Loading

jkosh44 commented Nov 27, 2024

benesch commented Nov 28, 2024

catalog: Deserialize audit log in background #30642

Are you sure you want to change the base?

catalog: Deserialize audit log in background #30642

Conversation

jkosh44 commented Nov 27, 2024 • edited Loading

Motivation

Checklist

jkosh44 commented Nov 27, 2024

shepherdlybot bot commented Nov 27, 2024 • edited Loading

Mitigations

jkosh44 commented Nov 27, 2024

benesch commented Nov 28, 2024

jkosh44 commented Nov 27, 2024 •

edited

Loading

shepherdlybot bot commented Nov 27, 2024 •

edited

Loading