Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

catalog: Deserialize audit log in background #30642

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jkosh44
Copy link
Contributor

@jkosh44 jkosh44 commented Nov 27, 2024

This commit optimizes the startup process to deserialize the audit log
in the background. Opening the durable catalog involves deserializing
all updates, migrating them, and then storing them in memory. Later the
in-memory catalog will take each update and generate a builtin table
update and apply that update in-memory. Throughout the startup process
audit logs are heavily special cased and only used to generate builtin
table updates. Additionally, audit logs are by far the largest catalog
collection and can take a long time to deserialize.

This commit creates a new thread at the beginning of the startup
process that deserializes all audit log updates. A handle for the
thread is plumbed throughout startup and then joined only when we
actually need the builtin table updates. By deserializing these updates
in the background we can reduce the total time spent in startup.

In order for this all to work, we have to disallow migrating audit log
updates, because now the audit log updates skip over migrations. In
practice this is OK, because the audit log has its own versioning
scheme that allows us to add new audit log variants without a
migration. The only thing we lose is the ability to re-write old audit
logs.

This speedup works only because other parts of startup take long enough
to hide the time spent deserializing the audit log. As other parts of
startup get faster, joining on the deserializing thread will get
slower. Some additional optimizations we can make in the future are:

  • Remove the audit log from the catalog.
  • Make the catalog shard queryable and remove the need to generate
    builtin table updates.
  • Now that the audit log is truly append-only, we could lazily
    deserialize the audit log updates in order by ID (how to order them
    is something we'd need to figure out), and stop once we've found
    the latest audit log update that is already in the builtin table.
    In general that will usually be the first audit log update we
    deserialize, unless we crashed after committing a catalog
    transaction but before updating the builtin tables.

Works towards resolving #MaterializeInc/database-issues/issues/8384

Motivation

This PR adds a known-desirable feature.

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@jkosh44 jkosh44 force-pushed the audit-log-speedup branch 4 times, most recently from 5092227 to 8cca855 Compare November 27, 2024 20:26
This commit optimizes the startup process to deserialize the audit log
in the background. Opening the durable catalog involves deserializing
all updates, migrating them, and then storing them in memory. Later the
in-memory catalog will take each update and generate a builtin table
update and apply that update in-memory. Throughout the startup process
audit logs are heavily special cased and only used to generate builtin
table updates. Additionally, audit logs are by far the largest catalog
collection and can take a long time to deserialize.

This commit creates a new thread at the beginning of the startup
process that deserializes all audit log updates. A handle for the
thread is plumbed throughout startup and then joined only when we
actually need the builtin table updates. By deserializing these updates
in the background we can reduce the total time spent in startup.

In order for this all to work, we have to disallow migrating audit log
updates, because now the audit log updates skip over migrations. In
practice this is OK, because the audit log has its own versioning
scheme that allows us to add new audit log variants without a
migration. The only thing we lose is the ability to re-write old audit
logs.

This speedup works only because other parts of startup take long enough
to hide the time spent deserializing the audit log. As other parts of
startup get faster, joining on the deserializing thread will get
slower. Some additional optimizations we can make in the future are:

  - Remove the audit log from the catalog.
  - Make the catalog shard queryable and remove the need to generate
    builtin table updates.
  - Now that the audit log is truly append-only, we could lazily
    deserialize the audit log updates in order by ID (how to order them
    is something we'd need to figure out), and stop once we've found
    the latest audit log update that is already in the builtin table.
    In general that will usually be the first audit log update we
    deserialize, unless we crashed after committing a catalog
    transaction but before updating the builtin tables.

Works towards resolving #MaterializeInc/database-issues/issues/8384
@jkosh44 jkosh44 changed the title WIP: Run audit log deserialization in background catalog: Deserialize audit log in background Nov 27, 2024
@jkosh44
Copy link
Contributor Author

jkosh44 commented Nov 27, 2024

In my staging env, I'm able to speed up startup times from

22.987157839s
22.062606634s
21.493512429s

to

19.273364726s
20.652808049s
19.888965235s

@jkosh44 jkosh44 marked this pull request as ready for review November 27, 2024 21:45
@jkosh44 jkosh44 requested a review from a team as a code owner November 27, 2024 21:45
Copy link

shepherdlybot bot commented Nov 27, 2024

Risk Score:80 / 100 Bug Hotspots:8 Resilience Coverage:33%

Mitigations

Completing required mitigations increases Resilience Coverage.

  • (Required) Code Review 🔍 Detected
  • (Required) Feature Flag
  • (Required) Integration Test
  • (Required) Observability 🔍 Detected
  • (Required) QA Review
  • (Required) Run Nightly Tests
  • Unit Test
Risk Summary:

This pull request carries a high risk score of 80, primarily driven by predictors such as "Sum Bug Reports Of Files" and "Delta of Executable Lines." Historically, PRs with these predictors are 115% more likely to introduce bugs compared to the repository baseline. Additionally, there are 8 file hotspots involved. While the observed and predicted bug trends for the repository are both decreasing, caution is still advised due to the significant historical risk associated with these predictors.

Note: The risk score is not based on semantic analysis but on historical predictors of bug occurrence in the repository. The attributes above were deemed the strongest predictors based on that history. Predictors and the score may change as the PR evolves in code, time, and review activity.

Bug Hotspots:
What's This?

File Percentile
../durable/upgrade.rs 97
../src/coord.rs 100
../src/catalog.rs 99
../durable/persist.rs 95
../src/main.rs 98
../src/lib.rs 95
../src/durable.rs 93
../durable/transaction.rs 93

@jkosh44
Copy link
Contributor Author

jkosh44 commented Nov 27, 2024

(FYI @benesch)

@benesch
Copy link
Member

benesch commented Nov 28, 2024

Neat! So this saves a second or two, if I'm reading your numbers correctly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants