Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upkeep service for aggregator & signer #1791

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

Alenar
Copy link
Collaborator

@Alenar Alenar commented Jun 28, 2024

Content

This PR add a new service to mithril-aggregator and mithril-signer, the UpkeepService:

  • Responsible for upkeep tasks that must be done periodically
  • Run at epoch change:
    • Aggregator: at the start of the Idle → Ready state machine transition, while the registration round is close.
    • Signer: at the start of the Unregistered → Registered state machine transition.
  • It run the following tasks (but more can be added as it should not be limited to database upkeep):
    • Vacuum the main database to compact it and reduce fragmentation.
    • Execute a pragma wal_checkpoint(TRUNCATE) on both the main and the cardano transaction database to avoid infinite grow of the wal files.
  • Database upkeep tasks are skipped if a signed entity type is locked since there would be a high chance that a database is busy (like when the cardano transation preloader is running).
  • This supersede the startup vacuum that has been added by Vacuum sqlite database at startup & after transactions preload #1771.

Note: the vacuum is not run on the cardano transaction database as it can be so large that it takes more than an hour on the mainnet.

Pre-submit checklist

  • Branch
    • Tests are provided (if possible)
    • Crates versions are updated (if relevant)
    • CHANGELOG file is updated (if relevant)
    • Commit sequence broadly makes sense
    • Key commits have useful messages
  • PR
    • No clippy warnings in the CI
    • Self-reviewed the diff
    • Useful pull request description
    • Reviewer requested
  • Documentation
    • Update README file (if relevant)
    • Update documentation website (if relevant)
    • Add dev blog post (if relevant)

Issue(s)

Closes #1707

Copy link

github-actions bot commented Jun 28, 2024

Test Results

    4 files  ±0     51 suites  ±0   8m 36s ⏱️ -14s
1 110 tests +9  1 110 ✅ +9  0 💤 ±0  0 ❌ ±0 
1 258 runs  +9  1 258 ✅ +9  0 💤 ±0  0 ❌ ±0 

Results for commit c990901. ± Comparison against base commit 4409ebb.

This pull request removes 1 and adds 10 tests. Note that renamed tests count towards both.
mithril-persistence ‑ sqlite::test::calling_vacuum_on_an_empty_in_memory_db_should_not_fail
mithril-aggregator ‑ services::upkeep::tests::test_cleanup_database
mithril-aggregator ‑ services::upkeep::tests::test_doesnt_cleanup_db_if_any_entity_is_locked
mithril-common ‑ signed_entity_type_lock::tests::has_locked_entities
mithril-persistence ‑ sqlite::cleaner::tests::cleanup_empty_file_without_wal_db_should_not_crash
mithril-persistence ‑ sqlite::cleaner::tests::cleanup_empty_in_memory_db_should_not_crash
mithril-persistence ‑ sqlite::cleaner::tests::test_truncate_wal
mithril-persistence ‑ sqlite::cleaner::tests::test_vacuum
mithril-signer ‑ runtime::runner::tests::test_upkeep
mithril-signer ‑ upkeep_service::tests::test_cleanup_database
mithril-signer ‑ upkeep_service::tests::test_doesnt_cleanup_db_if_any_entity_is_locked

♻️ This comment has been updated with latest results.

Copy link
Member

@jpraynaud jpraynaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering about what will happen during the cleanup for the REST API of the aggregator: will it be locked? Can we handle that situation with a 503 response?

pub fn get_sqlite_dir(&self) -> PathBuf {
let store_dir = &self.data_stores_directory;

if !store_dir.exists() {
if !store_dir.exists() && self.environment == ExecutionEnvironment::Production {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you restrict the creation to production only?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To match previous behavior.

Previous code from build_sqlite_connection function in dependency_injection/builder.rs:

 let connection_builder = match self.configuration.environment {
            ExecutionEnvironment::Production => ConnectionBuilder::open_file(
                &self.configuration.get_sqlite_dir().join(sqlite_file_name),
            ),
            _ if self.configuration.data_stores_directory.to_string_lossy() == ":memory:" => {
                ConnectionBuilder::open_memory()
            }
            _ => ConnectionBuilder::open_file(
                &self
                    .configuration
                    .data_stores_directory
                    .join(sqlite_file_name),
            ),
        };

There's a subtle difference between the first (production) and last (Test with file) branch:

  • the Production branch call get_sqlite_dir that use configuration.data_stores_directory and create the directory if it doens't exist
  • The Test with file branch use the configuration parameter directly, meaning that it doesn't create the directory.

My memory of why we did that is blurry, I think it was related to integration tests.

I can remove this condition to see if it's still relevant by checking if it all tests runs.

mithril-signer/src/runtime/state_machine.rs Show resolved Hide resolved
@@ -460,6 +460,8 @@ impl AggregatorRunnerTrait for AggregatorRunner {
}

async fn inform_new_epoch(&self, epoch: Epoch) -> StdResult<()> {
self.dependencies.upkeep_service.run().await?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's a better option to clean just after reopening the signer registration in the try_transition_from_idle_to_ready function? Also it would be more readable to not embed in the inform_new_epoch function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the upkeep service can lock the main database for a minute (for the vacuum) I thought that the best moment to run it was when the aggregator is at is lowest charge.
Also since it runs once per epoch it made sense to me to put it in the inform_new_epoch but I can create a new method in the runner if you prefer.

Copy link
Collaborator

@sfauvel sfauvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Alenar
Copy link
Collaborator Author

Alenar commented Jul 1, 2024

I'm wondering about what will happen during the cleanup for the REST API of the aggregator: will it be locked? Can we handle that situation with a 503 response?

@jpraynaud Currently if the repository layer fails because of an error (most likely a SQLITE_BUSY error in this case) the REST api returns a 500 response most of the time.

Returning another error code is possible:

To do so we need to rework the pub fn internal_server_error<T: Into<InternalServerError>>(message: T) -> Box<dyn warp::Reply> from mithril-aggregator/src/http_server/routes/reply.rs.
It would only take StdError as parameter so it could downcast ref it to a sqlite crate error, check the error code, see if it correspond to an SQLITE_BUSY and if yes changes the response from a 500 to a 503.

What do you think ?

@jpraynaud
Copy link
Member

I'm wondering about what will happen during the cleanup for the REST API of the aggregator: will it be locked? Can we handle that situation with a 503 response?

@jpraynaud Currently if the repository layer fails because of an error (most likely a [SQLITE_BUSY](https://www.sqlite.org/rescode.html#busy) error in this case) the REST api returns a 500 response most of the time.

Returning another error code is possible:

To do so we need to rework the pub fn internal_server_error<T: Into<InternalServerError>>(message: T) -> Box<dyn warp::Reply> from mithril-aggregator/src/http_server/routes/reply.rs. It would only take StdError as parameter so it could downcast ref it to a sqlite crate error, check the error code, see if it correspond to an SQLITE_BUSY and if yes changes the response from a 500 to a 503.

What do you think ?

I was thinking about a warp middleware that would return a 503 when the UpkeepService is running the cleanup (and that would be used on the routes which are impacted by the cleanup). But your solution can work as well 👍

@input-output-hk input-output-hk deleted a comment from jpraynaud Jul 1, 2024
Alenar added 14 commits July 1, 2024 18:46
As of now the only upkeep task is a cleanup of the databases by
running a vacuum and/or a wal checkpoint to reduce their sizes.
* Allow to open in memory database only in test.
* Avoid getting the same path from two different ways, with the only
  difference being the creation of the dir if it doesn't exist
And when the registration round is closed.
To not block the tokio runtime since tasks such as the vacuum can take
some time.
By checking that the tasks are done using the logs. The actual check
that the tasks do what they claims is done in the `SqliteCleaner` tests.
To make it use the same conventions as the aggregator.

* Rename signer `database::test_utils` `database::test_helper` and move
  it to a dedicated file.
* Copy the same `TestLogger`
* Sync aggregator & signer `CardanoTransactionImporter`
Like the aggregator for now the only upkeep task is to cleanup
the databases by running a vacuum and/or a wal checkpoint
to reduce their sizes.
Executed at the start of the `unregistered > registered` state machine
transition.

Note: In the integration test the store have been changed from
independent `MemoryAdapter` (that use hashmaps) to a `SQLiteAdapter`
with a shared in memory connection to make it closer to the production
environment.
* Since the UpkeepService now took this cleanup task.
* Remove the `Vacuum` option from the ConnectionBuilder as it was the
  only use.
* Remove `vacuum_database` function and instead use `SqliteCleaner`
  everywhere to streamline usage.
Since that would probably mean that some writing are currently occuring
on the database (like if the cardano transactions preloader is running).
@Alenar Alenar force-pushed the djo/1707/wal_not_truncated branch from bcb059a to c990901 Compare July 1, 2024 16:47
@Alenar Alenar temporarily deployed to testing-preview July 1, 2024 16:54 — with GitHub Actions Inactive
@Alenar Alenar temporarily deployed to testing-sanchonet July 1, 2024 16:54 — with GitHub Actions Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SQLite WAL files are not truncated in signer and aggregator
3 participants