Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upkeep service for aggregator & signer #1791

Merged
merged 19 commits into from
Jul 2, 2024
Merged

Conversation

Alenar
Copy link
Collaborator

@Alenar Alenar commented Jun 28, 2024

Content

This PR add a new service to mithril-aggregator and mithril-signer, the UpkeepService:

  • Responsible for upkeep tasks that must be done periodically
  • Run at epoch change:
    • Aggregator: at the start of the Idle → Ready state machine transition, while the registration round is close.
    • Signer: at the start of the Unregistered → Registered state machine transition.
  • It run the following tasks (but more can be added as it should not be limited to database upkeep):
    • Vacuum the main database to compact it and reduce fragmentation.
    • Execute a pragma wal_checkpoint(TRUNCATE) on both the main and the cardano transaction database to avoid infinite grow of the wal files.
  • Database upkeep tasks are skipped if a signed entity type is locked since there would be a high chance that a database is busy (like when the cardano transation preloader is running).
  • This supersede the startup vacuum that has been added by Vacuum sqlite database at startup & after transactions preload #1771.

Note: the vacuum is not run on the cardano transaction database as it can be so large that it takes more than an hour on the mainnet.

Pre-submit checklist

  • Branch
    • Tests are provided (if possible)
    • Crates versions are updated (if relevant)
    • CHANGELOG file is updated (if relevant)
    • Commit sequence broadly makes sense
    • Key commits have useful messages
  • PR
    • No clippy warnings in the CI
    • Self-reviewed the diff
    • Useful pull request description
    • Reviewer requested
  • Documentation
    • Update README file (if relevant)
    • Update documentation website (if relevant)
    • Add dev blog post (if relevant)

Issue(s)

Closes #1707

Copy link

github-actions bot commented Jun 28, 2024

Test Results

    4 files  ± 0     51 suites  ±0   8m 51s ⏱️ -3s
1 114 tests +13  1 114 ✅ +13  0 💤 ±0  0 ❌ ±0 
1 262 runs  +13  1 262 ✅ +13  0 💤 ±0  0 ❌ ±0 

Results for commit 35138ad. ± Comparison against base commit 5816ed4.

This pull request removes 1 and adds 14 tests. Note that renamed tests count towards both.
mithril-persistence ‑ sqlite::test::calling_vacuum_on_an_empty_in_memory_db_should_not_fail
mithril-aggregator ‑ http_server::routes::reply::tests::test_server_error_convert_signer_registration_round_not_yet_opened_to_503
mithril-aggregator ‑ http_server::routes::reply::tests::test_server_error_convert_std_error_to_500_by_default
mithril-aggregator ‑ http_server::routes::reply::tests::test_server_error_convert_wrapped_sqlite_busy_error_to_503
mithril-aggregator ‑ runtime::runner::tests::test_upkeep
mithril-aggregator ‑ services::upkeep::tests::test_cleanup_database
mithril-aggregator ‑ services::upkeep::tests::test_doesnt_cleanup_db_if_any_entity_is_locked
mithril-common ‑ signed_entity_type_lock::tests::has_locked_entities
mithril-persistence ‑ sqlite::cleaner::tests::cleanup_empty_file_without_wal_db_should_not_crash
mithril-persistence ‑ sqlite::cleaner::tests::cleanup_empty_in_memory_db_should_not_crash
mithril-persistence ‑ sqlite::cleaner::tests::test_truncate_wal
…

♻️ This comment has been updated with latest results.

Copy link
Member

@jpraynaud jpraynaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering about what will happen during the cleanup for the REST API of the aggregator: will it be locked? Can we handle that situation with a 503 response?

internal/mithril-persistence/src/sqlite/cleaner.rs Outdated Show resolved Hide resolved
mithril-aggregator/src/configuration.rs Outdated Show resolved Hide resolved
mithril-signer/src/runtime/state_machine.rs Outdated Show resolved Hide resolved
mithril-aggregator/src/runtime/runner.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@sfauvel sfauvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Alenar
Copy link
Collaborator Author

Alenar commented Jul 1, 2024

I'm wondering about what will happen during the cleanup for the REST API of the aggregator: will it be locked? Can we handle that situation with a 503 response?

@jpraynaud Currently if the repository layer fails because of an error (most likely a SQLITE_BUSY error in this case) the REST api returns a 500 response most of the time.

Returning another error code is possible:

To do so we need to rework the pub fn internal_server_error<T: Into<InternalServerError>>(message: T) -> Box<dyn warp::Reply> from mithril-aggregator/src/http_server/routes/reply.rs.
It would only take StdError as parameter so it could downcast ref it to a sqlite crate error, check the error code, see if it correspond to an SQLITE_BUSY and if yes changes the response from a 500 to a 503.

What do you think ?

@jpraynaud
Copy link
Member

I'm wondering about what will happen during the cleanup for the REST API of the aggregator: will it be locked? Can we handle that situation with a 503 response?

@jpraynaud Currently if the repository layer fails because of an error (most likely a [SQLITE_BUSY](https://www.sqlite.org/rescode.html#busy) error in this case) the REST api returns a 500 response most of the time.

Returning another error code is possible:

To do so we need to rework the pub fn internal_server_error<T: Into<InternalServerError>>(message: T) -> Box<dyn warp::Reply> from mithril-aggregator/src/http_server/routes/reply.rs. It would only take StdError as parameter so it could downcast ref it to a sqlite crate error, check the error code, see if it correspond to an SQLITE_BUSY and if yes changes the response from a 500 to a 503.

What do you think ?

I was thinking about a warp middleware that would return a 503 when the UpkeepService is running the cleanup (and that would be used on the routes which are impacted by the cleanup). But your solution can work as well 👍

@input-output-hk input-output-hk deleted a comment from jpraynaud Jul 1, 2024
@Alenar Alenar force-pushed the djo/1707/wal_not_truncated branch from bcb059a to c990901 Compare July 1, 2024 16:47
@Alenar Alenar temporarily deployed to testing-preview July 1, 2024 16:54 — with GitHub Actions Inactive
@Alenar Alenar temporarily deployed to testing-sanchonet July 1, 2024 16:54 — with GitHub Actions Inactive
Alenar added 12 commits July 2, 2024 15:46
As of now the only upkeep task is a cleanup of the databases by
running a vacuum and/or a wal checkpoint to reduce their sizes.
* Allow to open in memory database only in test.
* Avoid getting the same path from two different ways, with the only
  difference being the creation of the dir if it doesn't exist
And when the registration round is closed.
To not block the tokio runtime since tasks such as the vacuum can take
some time.
By checking that the tasks are done using the logs. The actual check
that the tasks do what they claims is done in the `SqliteCleaner` tests.
To make it use the same conventions as the aggregator.

* Rename signer `database::test_utils` `database::test_helper` and move
  it to a dedicated file.
* Copy the same `TestLogger`
* Sync aggregator & signer `CardanoTransactionImporter`
Like the aggregator for now the only upkeep task is to cleanup
the databases by running a vacuum and/or a wal checkpoint
to reduce their sizes.
Executed at the start of the `unregistered > registered` state machine
transition.

Note: In the integration test the store have been changed from
independent `MemoryAdapter` (that use hashmaps) to a `SQLiteAdapter`
with a shared in memory connection to make it closer to the production
environment.
* Since the UpkeepService now took this cleanup task.
* Remove the `Vacuum` option from the ConnectionBuilder as it was the
  only use.
* Remove `vacuum_database` function and instead use `SqliteCleaner`
  everywhere to streamline usage.
Since that would probably mean that some writing are currently occuring
on the database (like if the cardano transactions preloader is running).
@Alenar Alenar force-pushed the djo/1707/wal_not_truncated branch from 840941f to 8fdd869 Compare July 2, 2024 13:46
@Alenar Alenar temporarily deployed to testing-preview July 2, 2024 13:53 — with GitHub Actions Inactive
@Alenar Alenar temporarily deployed to testing-sanchonet July 2, 2024 13:53 — with GitHub Actions Inactive
@Alenar Alenar force-pushed the djo/1707/wal_not_truncated branch from 8fdd869 to 1bb16f1 Compare July 2, 2024 14:05
@Alenar Alenar temporarily deployed to testing-preview July 2, 2024 14:12 — with GitHub Actions Inactive
@Alenar Alenar temporarily deployed to testing-sanchonet July 2, 2024 14:12 — with GitHub Actions Inactive
@Alenar Alenar temporarily deployed to testing-preview July 2, 2024 14:21 — with GitHub Actions Inactive
@Alenar Alenar temporarily deployed to testing-sanchonet July 2, 2024 14:21 — with GitHub Actions Inactive
@Alenar Alenar force-pushed the djo/1707/wal_not_truncated branch from 03d6fbd to dec18c5 Compare July 2, 2024 14:24
* Make signer run upkeep after registration
* Make upkeep a separate task in aggregator runner
* Make `Configuration::get_sqlite_dir` always create the directory if it
  doesn't exist instead of only in Production environment. As this
  behavior doesn't make tests crash anymore now.
* Typo fix
This method will replace replace most usage of the
`reply::internal_server_error` as it analyse as it can change the http
error code returned based on the error given:
* by default the error is `500 - InternalServerError`
* For Sqlite Busy errors or RegistrationRoundNotYetOpened error the
  error code change to `503 - ServiceUnavailable`
Mostly replacing usage of `reply::internal_server_error` as the new
function can return varius status code depending on the error.
As it's used as the reply message for multiple server side errors, not
just `500 - InternalServerError`.
* Mithril-aggregator from `0.5.32` to `0.5.33`
* Mithril-signer from `0.2.155` to `0.2.156`
* Mithril-common from `0.4.24` to `0.4.25`
* Mithril-persistence from `0.2.13` to `0.2.14`
@Alenar Alenar force-pushed the djo/1707/wal_not_truncated branch from dec18c5 to 35138ad Compare July 2, 2024 14:26
@Alenar Alenar temporarily deployed to testing-preview July 2, 2024 14:33 — with GitHub Actions Inactive
@Alenar Alenar temporarily deployed to testing-sanchonet July 2, 2024 14:33 — with GitHub Actions Inactive
@Alenar Alenar merged commit 36f9f4c into main Jul 2, 2024
43 of 44 checks passed
@Alenar Alenar deleted the djo/1707/wal_not_truncated branch July 2, 2024 14:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SQLite WAL files are not truncated in signer and aggregator
3 participants