-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
catalog: Deserialize audit log in background #30642
base: main
Are you sure you want to change the base?
Conversation
5092227
to
8cca855
Compare
This commit optimizes the startup process to deserialize the audit log in the background. Opening the durable catalog involves deserializing all updates, migrating them, and then storing them in memory. Later the in-memory catalog will take each update and generate a builtin table update and apply that update in-memory. Throughout the startup process audit logs are heavily special cased and only used to generate builtin table updates. Additionally, audit logs are by far the largest catalog collection and can take a long time to deserialize. This commit creates a new thread at the beginning of the startup process that deserializes all audit log updates. A handle for the thread is plumbed throughout startup and then joined only when we actually need the builtin table updates. By deserializing these updates in the background we can reduce the total time spent in startup. In order for this all to work, we have to disallow migrating audit log updates, because now the audit log updates skip over migrations. In practice this is OK, because the audit log has its own versioning scheme that allows us to add new audit log variants without a migration. The only thing we lose is the ability to re-write old audit logs. This speedup works only because other parts of startup take long enough to hide the time spent deserializing the audit log. As other parts of startup get faster, joining on the deserializing thread will get slower. Some additional optimizations we can make in the future are: - Remove the audit log from the catalog. - Make the catalog shard queryable and remove the need to generate builtin table updates. - Now that the audit log is truly append-only, we could lazily deserialize the audit log updates in order by ID (how to order them is something we'd need to figure out), and stop once we've found the latest audit log update that is already in the builtin table. In general that will usually be the first audit log update we deserialize, unless we crashed after committing a catalog transaction but before updating the builtin tables. Works towards resolving #MaterializeInc/database-issues/issues/8384
8cca855
to
e093826
Compare
In my staging env, I'm able to speed up startup times from 22.987157839s to 19.273364726s |
MitigationsCompleting required mitigations increases Resilience Coverage.
Risk Summary:This pull request carries a high risk score of 80, primarily driven by predictors such as "Sum Bug Reports Of Files" and "Delta of Executable Lines." Historically, PRs with these predictors are 115% more likely to introduce bugs compared to the repository baseline. Additionally, there are 8 file hotspots involved. While the observed and predicted bug trends for the repository are both decreasing, caution is still advised due to the significant historical risk associated with these predictors. Note: The risk score is not based on semantic analysis but on historical predictors of bug occurrence in the repository. The attributes above were deemed the strongest predictors based on that history. Predictors and the score may change as the PR evolves in code, time, and review activity. Bug Hotspots:
|
(FYI @benesch) |
Neat! So this saves a second or two, if I'm reading your numbers correctly? |
This commit optimizes the startup process to deserialize the audit log
in the background. Opening the durable catalog involves deserializing
all updates, migrating them, and then storing them in memory. Later the
in-memory catalog will take each update and generate a builtin table
update and apply that update in-memory. Throughout the startup process
audit logs are heavily special cased and only used to generate builtin
table updates. Additionally, audit logs are by far the largest catalog
collection and can take a long time to deserialize.
This commit creates a new thread at the beginning of the startup
process that deserializes all audit log updates. A handle for the
thread is plumbed throughout startup and then joined only when we
actually need the builtin table updates. By deserializing these updates
in the background we can reduce the total time spent in startup.
In order for this all to work, we have to disallow migrating audit log
updates, because now the audit log updates skip over migrations. In
practice this is OK, because the audit log has its own versioning
scheme that allows us to add new audit log variants without a
migration. The only thing we lose is the ability to re-write old audit
logs.
This speedup works only because other parts of startup take long enough
to hide the time spent deserializing the audit log. As other parts of
startup get faster, joining on the deserializing thread will get
slower. Some additional optimizations we can make in the future are:
builtin table updates.
deserialize the audit log updates in order by ID (how to order them
is something we'd need to figure out), and stop once we've found
the latest audit log update that is already in the builtin table.
In general that will usually be the first audit log update we
deserialize, unless we crashed after committing a catalog
transaction but before updating the builtin tables.
Works towards resolving #MaterializeInc/database-issues/issues/8384
Motivation
This PR adds a known-desirable feature.
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.