Upkeep service for aggregator & signer #1791

Alenar · 2024-06-28T16:27:03Z

Content

This PR add a new service to mithril-aggregator and mithril-signer, the UpkeepService:

Responsible for upkeep tasks that must be done periodically
Run at epoch change:
- Aggregator: at the start of the Idle → Ready state machine transition, while the registration round is close.
- Signer: at the start of the Unregistered → Registered state machine transition.
It run the following tasks (but more can be added as it should not be limited to database upkeep):
- Vacuum the main database to compact it and reduce fragmentation.
- Execute a pragma wal_checkpoint(TRUNCATE) on both the main and the cardano transaction database to avoid infinite grow of the wal files.
Database upkeep tasks are skipped if a signed entity type is locked since there would be a high chance that a database is busy (like when the cardano transation preloader is running).
This supersede the startup vacuum that has been added by Vacuum sqlite database at startup & after transactions preload #1771.

Note: the vacuum is not run on the cardano transaction database as it can be so large that it takes more than an hour on the mainnet.

Pre-submit checklist

Branch
- Tests are provided (if possible)
- Crates versions are updated (if relevant)
- CHANGELOG file is updated (if relevant)
- Commit sequence broadly makes sense
- Key commits have useful messages
PR
- No clippy warnings in the CI
- Self-reviewed the diff
- Useful pull request description
- Reviewer requested
Documentation
- Update README file (if relevant)
- Update documentation website (if relevant)
- Add dev blog post (if relevant)

Issue(s)

Closes #1707

internal/mithril-persistence/src/sqlite/cleaner.rs

github-actions · 2024-06-28T16:33:53Z

Test Results

4 files ±0 51 suites ±0 8m 36s ⏱️ -14s
1 110 tests +9 1 110 ✅ +9 0 💤 ±0 0 ❌ ±0
1 258 runs +9 1 258 ✅ +9 0 💤 ±0 0 ❌ ±0

Results for commit c990901. ± Comparison against base commit 4409ebb.

This pull request removes 1 and adds 10 tests. Note that renamed tests count towards both.

mithril-persistence ‑ sqlite::test::calling_vacuum_on_an_empty_in_memory_db_should_not_fail

mithril-aggregator ‑ services::upkeep::tests::test_cleanup_database
mithril-aggregator ‑ services::upkeep::tests::test_doesnt_cleanup_db_if_any_entity_is_locked
mithril-common ‑ signed_entity_type_lock::tests::has_locked_entities
mithril-persistence ‑ sqlite::cleaner::tests::cleanup_empty_file_without_wal_db_should_not_crash
mithril-persistence ‑ sqlite::cleaner::tests::cleanup_empty_in_memory_db_should_not_crash
mithril-persistence ‑ sqlite::cleaner::tests::test_truncate_wal
mithril-persistence ‑ sqlite::cleaner::tests::test_vacuum
mithril-signer ‑ runtime::runner::tests::test_upkeep
mithril-signer ‑ upkeep_service::tests::test_cleanup_database
mithril-signer ‑ upkeep_service::tests::test_doesnt_cleanup_db_if_any_entity_is_locked

♻️ This comment has been updated with latest results.

jpraynaud

I'm wondering about what will happen during the cleanup for the REST API of the aggregator: will it be locked? Can we handle that situation with a 503 response?

internal/mithril-persistence/src/sqlite/cleaner.rs

jpraynaud · 2024-07-01T13:39:12Z

mithril-aggregator/src/configuration.rs

    pub fn get_sqlite_dir(&self) -> PathBuf {
        let store_dir = &self.data_stores_directory;

-        if !store_dir.exists() {
+        if !store_dir.exists() && self.environment == ExecutionEnvironment::Production {


Why did you restrict the creation to production only?

To match previous behavior.

Previous code from build_sqlite_connection function in dependency_injection/builder.rs:

let connection_builder = match self.configuration.environment { ExecutionEnvironment::Production => ConnectionBuilder::open_file( &self.configuration.get_sqlite_dir().join(sqlite_file_name), ), _ if self.configuration.data_stores_directory.to_string_lossy() == ":memory:" => { ConnectionBuilder::open_memory() } _ => ConnectionBuilder::open_file( &self .configuration .data_stores_directory .join(sqlite_file_name), ), };

There's a subtle difference between the first (production) and last (Test with file) branch:

the Production branch call get_sqlite_dir that use configuration.data_stores_directory and create the directory if it doens't exist

The Test with file branch use the configuration parameter directly, meaning that it doesn't create the directory.

My memory of why we did that is blurry, I think it was related to integration tests.

I can remove this condition to see if it's still relevant by checking if it all tests runs.

mithril-signer/src/runtime/state_machine.rs

jpraynaud · 2024-07-01T13:58:49Z

mithril-aggregator/src/runtime/runner.rs

@@ -460,6 +460,8 @@ impl AggregatorRunnerTrait for AggregatorRunner {
    }

    async fn inform_new_epoch(&self, epoch: Epoch) -> StdResult<()> {
+        self.dependencies.upkeep_service.run().await?;


Maybe it's a better option to clean just after reopening the signer registration in the try_transition_from_idle_to_ready function? Also it would be more readable to not embed in the inform_new_epoch function.

Since the upkeep service can lock the main database for a minute (for the vacuum) I thought that the best moment to run it was when the aggregator is at is lowest charge.
Also since it runs once per epoch it made sense to me to put it in the inform_new_epoch but I can create a new method in the runner if you prefer.

sfauvel

LGTM

Alenar · 2024-07-01T14:53:02Z

I'm wondering about what will happen during the cleanup for the REST API of the aggregator: will it be locked? Can we handle that situation with a 503 response?

@jpraynaud Currently if the repository layer fails because of an error (most likely a SQLITE_BUSY error in this case) the REST api returns a 500 response most of the time.

Returning another error code is possible:

To do so we need to rework the pub fn internal_server_error<T: Into<InternalServerError>>(message: T) -> Box<dyn warp::Reply> from mithril-aggregator/src/http_server/routes/reply.rs.
It would only take StdError as parameter so it could downcast ref it to a sqlite crate error, check the error code, see if it correspond to an SQLITE_BUSY and if yes changes the response from a 500 to a 503.

What do you think ?

jpraynaud · 2024-07-01T15:02:24Z

I'm wondering about what will happen during the cleanup for the REST API of the aggregator: will it be locked? Can we handle that situation with a 503 response?

@jpraynaud Currently if the repository layer fails because of an error (most likely a [SQLITE_BUSY](https://www.sqlite.org/rescode.html#busy) error in this case) the REST api returns a 500 response most of the time.

Returning another error code is possible:

To do so we need to rework the pub fn internal_server_error<T: Into<InternalServerError>>(message: T) -> Box<dyn warp::Reply> from mithril-aggregator/src/http_server/routes/reply.rs. It would only take StdError as parameter so it could downcast ref it to a sqlite crate error, check the error code, see if it correspond to an SQLITE_BUSY and if yes changes the response from a 500 to a 503.

What do you think ?

I was thinking about a warp middleware that would return a 503 when the UpkeepService is running the cleanup (and that would be used on the routes which are impacted by the cleanup). But your solution can work as well 👍

As of now the only upkeep task is a cleanup of the databases by running a vacuum and/or a wal checkpoint to reduce their sizes.

* Allow to open in memory database only in test. * Avoid getting the same path from two different ways, with the only difference being the creation of the dir if it doesn't exist

And when the registration round is closed.

To not block the tokio runtime since tasks such as the vacuum can take some time.

By checking that the tasks are done using the logs. The actual check that the tasks do what they claims is done in the `SqliteCleaner` tests.

To make it use the same conventions as the aggregator. * Rename signer `database::test_utils` `database::test_helper` and move it to a dedicated file. * Copy the same `TestLogger` * Sync aggregator & signer `CardanoTransactionImporter`

Like the aggregator for now the only upkeep task is to cleanup the databases by running a vacuum and/or a wal checkpoint to reduce their sizes.

Executed at the start of the `unregistered > registered` state machine transition. Note: In the integration test the store have been changed from independent `MemoryAdapter` (that use hashmaps) to a `SQLiteAdapter` with a shared in memory connection to make it closer to the production environment.

* Since the UpkeepService now took this cleanup task. * Remove the `Vacuum` option from the ConnectionBuilder as it was the only use. * Remove `vacuum_database` function and instead use `SqliteCleaner` everywhere to streamline usage.

Since that would probably mean that some writing are currently occuring on the database (like if the cardano transactions preloader is running).

Alenar requested review from sfauvel, jpraynaud and dlachaume June 28, 2024 16:27

github-advanced-security bot found potential problems Jun 28, 2024

View reviewed changes

internal/mithril-persistence/src/sqlite/cleaner.rs Fixed Show fixed Hide fixed

internal/mithril-persistence/src/sqlite/cleaner.rs Fixed Show fixed Hide fixed

Alenar temporarily deployed to testing-preview June 28, 2024 16:34 — with GitHub Actions Inactive

Alenar temporarily deployed to testing-sanchonet June 28, 2024 16:34 — with GitHub Actions Inactive

Alenar force-pushed the djo/1707/wal_not_truncated branch from fd6fc4e to bcb059a Compare June 28, 2024 16:35

Alenar temporarily deployed to testing-preview June 28, 2024 16:43 — with GitHub Actions Inactive

Alenar temporarily deployed to testing-sanchonet June 28, 2024 16:43 — with GitHub Actions Inactive

jpraynaud reviewed Jul 1, 2024

View reviewed changes

sfauvel approved these changes Jul 1, 2024

View reviewed changes

input-output-hk deleted a comment from jpraynaud Jul 1, 2024

Alenar added 14 commits July 1, 2024 18:46

Add UpkeepService to Aggregator

966fd7c

As of now the only upkeep task is a cleanup of the databases by running a vacuum and/or a wal checkpoint to reduce their sizes.

Wire UpkeepService in aggregator dependency injection

697e618

Simplify sqlite paths management in aggregator dependency injection

d87a3cb

* Allow to open in memory database only in test. * Avoid getting the same path from two different ways, with the only difference being the creation of the dir if it doesn't exist

Run the upkeep service at epoch change

19867d1

And when the registration round is closed.

Run the database upkeep task in another thread

1828124

To not block the tokio runtime since tasks such as the vacuum can take some time.

Add logs to aggregator upkeep service

e368b38

Move sqlite cleaning part to mithril-persistence so it can be shared

f7b648f

UpkeepService now use injected connection instead of building them

9c729b8

Simplify aggregator UpkeepService tests

0dfb96f

By checking that the tasks are done using the logs. The actual check that the tasks do what they claims is done in the `SqliteCleaner` tests.

Streamline signer test tool

56b2f88

To make it use the same conventions as the aggregator. * Rename signer `database::test_utils` `database::test_helper` and move it to a dedicated file. * Copy the same `TestLogger` * Sync aggregator & signer `CardanoTransactionImporter`

Add UpkeepService to Signer

d7cbc74

Like the aggregator for now the only upkeep task is to cleanup the databases by running a vacuum and/or a wal checkpoint to reduce their sizes.

Skip database upkeep if any signed entity type is locked

c990901

Since that would probably mean that some writing are currently occuring on the database (like if the cardano transactions preloader is running).

Alenar force-pushed the djo/1707/wal_not_truncated branch from bcb059a to c990901 Compare July 1, 2024 16:47

Alenar temporarily deployed to testing-preview July 1, 2024 16:54 — with GitHub Actions Inactive

Alenar temporarily deployed to testing-sanchonet July 1, 2024 16:54 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upkeep service for aggregator & signer #1791

Upkeep service for aggregator & signer #1791

Alenar commented Jun 28, 2024 •

edited

Loading

github-actions bot commented Jun 28, 2024 •

edited

Loading

jpraynaud left a comment

jpraynaud Jul 1, 2024

Alenar Jul 1, 2024

jpraynaud Jul 1, 2024

Alenar Jul 1, 2024

sfauvel left a comment

Alenar commented Jul 1, 2024 •

edited

Loading

jpraynaud commented Jul 1, 2024

Upkeep service for aggregator & signer #1791

Are you sure you want to change the base?

Upkeep service for aggregator & signer #1791

Conversation

Alenar commented Jun 28, 2024 • edited Loading

Content

Pre-submit checklist

Issue(s)

github-actions bot commented Jun 28, 2024 • edited Loading

Test Results

jpraynaud left a comment

Choose a reason for hiding this comment

jpraynaud Jul 1, 2024

Choose a reason for hiding this comment

Alenar Jul 1, 2024

Choose a reason for hiding this comment

jpraynaud Jul 1, 2024

Choose a reason for hiding this comment

Alenar Jul 1, 2024

Choose a reason for hiding this comment

sfauvel left a comment

Choose a reason for hiding this comment

Alenar commented Jul 1, 2024 • edited Loading

jpraynaud commented Jul 1, 2024

Alenar commented Jun 28, 2024 •

edited

Loading

github-actions bot commented Jun 28, 2024 •

edited

Loading

Alenar commented Jul 1, 2024 •

edited

Loading