Upkeep service for aggregator & signer #1791

Alenar · 2024-06-28T16:27:03Z

Content

This PR add a new service to mithril-aggregator and mithril-signer, the UpkeepService:

Responsible for upkeep tasks that must be done periodically
Run at epoch change:
- Aggregator: at the start of the Idle → Ready state machine transition, while the registration round is close.
- Signer: at the start of the Unregistered → Registered state machine transition.
It run the following tasks (but more can be added as it should not be limited to database upkeep):
- Vacuum the main database to compact it and reduce fragmentation.
- Execute a pragma wal_checkpoint(TRUNCATE) on both the main and the cardano transaction database to avoid infinite grow of the wal files.
Database upkeep tasks are skipped if a signed entity type is locked since there would be a high chance that a database is busy (like when the cardano transation preloader is running).
This supersede the startup vacuum that has been added by Vacuum sqlite database at startup & after transactions preload #1771.

Note: the vacuum is not run on the cardano transaction database as it can be so large that it takes more than an hour on the mainnet.

Pre-submit checklist

Branch
- Tests are provided (if possible)
- Crates versions are updated (if relevant)
- CHANGELOG file is updated (if relevant)
- Commit sequence broadly makes sense
- Key commits have useful messages
PR
- No clippy warnings in the CI
- Self-reviewed the diff
- Useful pull request description
- Reviewer requested
Documentation
- Update README file (if relevant)
- Update documentation website (if relevant)
- Add dev blog post (if relevant)

Issue(s)

Closes #1707

internal/mithril-persistence/src/sqlite/cleaner.rs

github-actions · 2024-06-28T16:33:53Z

Test Results

4 files ± 0 51 suites ±0 8m 51s ⏱️ -3s
1 114 tests +13 1 114 ✅ +13 0 💤 ±0 0 ❌ ±0
1 262 runs +13 1 262 ✅ +13 0 💤 ±0 0 ❌ ±0

Results for commit 35138ad. ± Comparison against base commit 5816ed4.

This pull request removes 1 and adds 14 tests. Note that renamed tests count towards both.

mithril-persistence ‑ sqlite::test::calling_vacuum_on_an_empty_in_memory_db_should_not_fail

mithril-aggregator ‑ http_server::routes::reply::tests::test_server_error_convert_signer_registration_round_not_yet_opened_to_503
mithril-aggregator ‑ http_server::routes::reply::tests::test_server_error_convert_std_error_to_500_by_default
mithril-aggregator ‑ http_server::routes::reply::tests::test_server_error_convert_wrapped_sqlite_busy_error_to_503
mithril-aggregator ‑ runtime::runner::tests::test_upkeep
mithril-aggregator ‑ services::upkeep::tests::test_cleanup_database
mithril-aggregator ‑ services::upkeep::tests::test_doesnt_cleanup_db_if_any_entity_is_locked
mithril-common ‑ signed_entity_type_lock::tests::has_locked_entities
mithril-persistence ‑ sqlite::cleaner::tests::cleanup_empty_file_without_wal_db_should_not_crash
mithril-persistence ‑ sqlite::cleaner::tests::cleanup_empty_in_memory_db_should_not_crash
mithril-persistence ‑ sqlite::cleaner::tests::test_truncate_wal
…

♻️ This comment has been updated with latest results.

jpraynaud

I'm wondering about what will happen during the cleanup for the REST API of the aggregator: will it be locked? Can we handle that situation with a 503 response?

internal/mithril-persistence/src/sqlite/cleaner.rs

mithril-aggregator/src/configuration.rs

mithril-signer/src/runtime/state_machine.rs

mithril-aggregator/src/runtime/runner.rs

sfauvel

LGTM

Alenar · 2024-07-01T14:53:02Z

I'm wondering about what will happen during the cleanup for the REST API of the aggregator: will it be locked? Can we handle that situation with a 503 response?

@jpraynaud Currently if the repository layer fails because of an error (most likely a SQLITE_BUSY error in this case) the REST api returns a 500 response most of the time.

Returning another error code is possible:

To do so we need to rework the pub fn internal_server_error<T: Into<InternalServerError>>(message: T) -> Box<dyn warp::Reply> from mithril-aggregator/src/http_server/routes/reply.rs.
It would only take StdError as parameter so it could downcast ref it to a sqlite crate error, check the error code, see if it correspond to an SQLITE_BUSY and if yes changes the response from a 500 to a 503.

What do you think ?

jpraynaud · 2024-07-01T15:02:24Z

I'm wondering about what will happen during the cleanup for the REST API of the aggregator: will it be locked? Can we handle that situation with a 503 response?

@jpraynaud Currently if the repository layer fails because of an error (most likely a [SQLITE_BUSY](https://www.sqlite.org/rescode.html#busy) error in this case) the REST api returns a 500 response most of the time.

Returning another error code is possible:

To do so we need to rework the pub fn internal_server_error<T: Into<InternalServerError>>(message: T) -> Box<dyn warp::Reply> from mithril-aggregator/src/http_server/routes/reply.rs. It would only take StdError as parameter so it could downcast ref it to a sqlite crate error, check the error code, see if it correspond to an SQLITE_BUSY and if yes changes the response from a 500 to a 503.

What do you think ?

I was thinking about a warp middleware that would return a 503 when the UpkeepService is running the cleanup (and that would be used on the routes which are impacted by the cleanup). But your solution can work as well 👍

As of now the only upkeep task is a cleanup of the databases by running a vacuum and/or a wal checkpoint to reduce their sizes.

* Allow to open in memory database only in test. * Avoid getting the same path from two different ways, with the only difference being the creation of the dir if it doesn't exist

And when the registration round is closed.

To not block the tokio runtime since tasks such as the vacuum can take some time.

By checking that the tasks are done using the logs. The actual check that the tasks do what they claims is done in the `SqliteCleaner` tests.

To make it use the same conventions as the aggregator. * Rename signer `database::test_utils` `database::test_helper` and move it to a dedicated file. * Copy the same `TestLogger` * Sync aggregator & signer `CardanoTransactionImporter`

Like the aggregator for now the only upkeep task is to cleanup the databases by running a vacuum and/or a wal checkpoint to reduce their sizes.

Executed at the start of the `unregistered > registered` state machine transition. Note: In the integration test the store have been changed from independent `MemoryAdapter` (that use hashmaps) to a `SQLiteAdapter` with a shared in memory connection to make it closer to the production environment.

* Since the UpkeepService now took this cleanup task. * Remove the `Vacuum` option from the ConnectionBuilder as it was the only use. * Remove `vacuum_database` function and instead use `SqliteCleaner` everywhere to streamline usage.

Since that would probably mean that some writing are currently occuring on the database (like if the cardano transactions preloader is running).

* Make signer run upkeep after registration * Make upkeep a separate task in aggregator runner * Make `Configuration::get_sqlite_dir` always create the directory if it doesn't exist instead of only in Production environment. As this behavior doesn't make tests crash anymore now. * Typo fix

This method will replace replace most usage of the `reply::internal_server_error` as it analyse as it can change the http error code returned based on the error given: * by default the error is `500 - InternalServerError` * For Sqlite Busy errors or RegistrationRoundNotYetOpened error the error code change to `503 - ServiceUnavailable`

Mostly replacing usage of `reply::internal_server_error` as the new function can return varius status code depending on the error.

As it's used as the reply message for multiple server side errors, not just `500 - InternalServerError`.

* Mithril-aggregator from `0.5.32` to `0.5.33` * Mithril-signer from `0.2.155` to `0.2.156` * Mithril-common from `0.4.24` to `0.4.25` * Mithril-persistence from `0.2.13` to `0.2.14`

Alenar requested review from sfauvel, jpraynaud and dlachaume June 28, 2024 16:27

github-advanced-security bot found potential problems Jun 28, 2024

View reviewed changes

internal/mithril-persistence/src/sqlite/cleaner.rs Fixed Show fixed Hide fixed

internal/mithril-persistence/src/sqlite/cleaner.rs Fixed Show fixed Hide fixed

Alenar temporarily deployed to testing-preview June 28, 2024 16:34 — with GitHub Actions Inactive

Alenar temporarily deployed to testing-sanchonet June 28, 2024 16:34 — with GitHub Actions Inactive

Alenar force-pushed the djo/1707/wal_not_truncated branch from fd6fc4e to bcb059a Compare June 28, 2024 16:35

Alenar temporarily deployed to testing-preview June 28, 2024 16:43 — with GitHub Actions Inactive

Alenar temporarily deployed to testing-sanchonet June 28, 2024 16:43 — with GitHub Actions Inactive

jpraynaud reviewed Jul 1, 2024

View reviewed changes

sfauvel approved these changes Jul 1, 2024

View reviewed changes

input-output-hk deleted a comment from jpraynaud Jul 1, 2024

Alenar force-pushed the djo/1707/wal_not_truncated branch from bcb059a to c990901 Compare July 1, 2024 16:47

Alenar temporarily deployed to testing-preview July 1, 2024 16:54 — with GitHub Actions Inactive

Alenar temporarily deployed to testing-sanchonet July 1, 2024 16:54 — with GitHub Actions Inactive

Alenar added 12 commits July 2, 2024 15:46

Add UpkeepService to Aggregator

e7124ea

As of now the only upkeep task is a cleanup of the databases by running a vacuum and/or a wal checkpoint to reduce their sizes.

Wire UpkeepService in aggregator dependency injection

2b327cd

Simplify sqlite paths management in aggregator dependency injection

858fb40

* Allow to open in memory database only in test. * Avoid getting the same path from two different ways, with the only difference being the creation of the dir if it doesn't exist

Run the upkeep service at epoch change

3da8f7b

And when the registration round is closed.

Run the database upkeep task in another thread

a54f09b

To not block the tokio runtime since tasks such as the vacuum can take some time.

Add logs to aggregator upkeep service

4206b16

Move sqlite cleaning part to mithril-persistence so it can be shared

818349f

UpkeepService now use injected connection instead of building them

395ccff

Simplify aggregator UpkeepService tests

33f29c2

By checking that the tasks are done using the logs. The actual check that the tasks do what they claims is done in the `SqliteCleaner` tests.

Streamline signer test tool

d7da37a

To make it use the same conventions as the aggregator. * Rename signer `database::test_utils` `database::test_helper` and move it to a dedicated file. * Copy the same `TestLogger` * Sync aggregator & signer `CardanoTransactionImporter`

Add UpkeepService to Signer

de3dadf

Like the aggregator for now the only upkeep task is to cleanup the databases by running a vacuum and/or a wal checkpoint to reduce their sizes.

Alenar added 2 commits July 2, 2024 15:46

Skip database upkeep if any signed entity type is locked

95b56ed

Since that would probably mean that some writing are currently occuring on the database (like if the cardano transactions preloader is running).

Alenar force-pushed the djo/1707/wal_not_truncated branch from 840941f to 8fdd869 Compare July 2, 2024 13:46

Alenar temporarily deployed to testing-preview July 2, 2024 13:53 — with GitHub Actions Inactive

Alenar temporarily deployed to testing-sanchonet July 2, 2024 13:53 — with GitHub Actions Inactive

Alenar force-pushed the djo/1707/wal_not_truncated branch from 8fdd869 to 1bb16f1 Compare July 2, 2024 14:05

Alenar temporarily deployed to testing-preview July 2, 2024 14:12 — with GitHub Actions Inactive

Alenar temporarily deployed to testing-sanchonet July 2, 2024 14:12 — with GitHub Actions Inactive

Alenar temporarily deployed to testing-preview July 2, 2024 14:21 — with GitHub Actions Inactive

Alenar temporarily deployed to testing-sanchonet July 2, 2024 14:21 — with GitHub Actions Inactive

Alenar force-pushed the djo/1707/wal_not_truncated branch from 03d6fbd to dec18c5 Compare July 2, 2024 14:24

Alenar added 5 commits July 2, 2024 16:26

Use reply::server_error where applicable

db14085

Mostly replacing usage of `reply::internal_server_error` as the new function can return varius status code depending on the error.

Rename InternalServerError struct to ServerError

d7f53d6

As it's used as the reply message for multiple server side errors, not just `500 - InternalServerError`.

Update changelog & upgrade crates versions

35138ad

* Mithril-aggregator from `0.5.32` to `0.5.33` * Mithril-signer from `0.2.155` to `0.2.156` * Mithril-common from `0.4.24` to `0.4.25` * Mithril-persistence from `0.2.13` to `0.2.14`

Alenar force-pushed the djo/1707/wal_not_truncated branch from dec18c5 to 35138ad Compare July 2, 2024 14:26

Alenar temporarily deployed to testing-preview July 2, 2024 14:33 — with GitHub Actions Inactive

Alenar temporarily deployed to testing-sanchonet July 2, 2024 14:33 — with GitHub Actions Inactive

Alenar merged commit 36f9f4c into main Jul 2, 2024
43 of 44 checks passed

Alenar deleted the djo/1707/wal_not_truncated branch July 2, 2024 14:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upkeep service for aggregator & signer #1791

Upkeep service for aggregator & signer #1791

Alenar commented Jun 28, 2024 •

edited

Loading

github-actions bot commented Jun 28, 2024 •

edited

Loading

jpraynaud left a comment

sfauvel left a comment

Alenar commented Jul 1, 2024 •

edited

Loading

jpraynaud commented Jul 1, 2024

Upkeep service for aggregator & signer #1791

Upkeep service for aggregator & signer #1791

Conversation

Alenar commented Jun 28, 2024 • edited Loading

Content

Pre-submit checklist

Issue(s)

github-actions bot commented Jun 28, 2024 • edited Loading

Test Results

jpraynaud left a comment

Choose a reason for hiding this comment

sfauvel left a comment

Choose a reason for hiding this comment

Alenar commented Jul 1, 2024 • edited Loading

jpraynaud commented Jul 1, 2024

Alenar commented Jun 28, 2024 •

edited

Loading

github-actions bot commented Jun 28, 2024 •

edited

Loading

Alenar commented Jul 1, 2024 •

edited

Loading