-
Notifications
You must be signed in to change notification settings - Fork 277
feature: expire snapshots action #1455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feature: expire snapshots action #1455
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @cmcarthur , thanks for the PR! I really like the overall design!
There may be some tweaks we can use to integrate with the transaction API better. I've put my thoughts in the comments
} | ||
|
||
// update the table metadata to remove the expired snapshots _before_ deleting anything! | ||
// TODO: make this retry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm planning to add retry logic in tx.commit
, so you don't have to retry here
issue: #1387
pub struct ExpireSnapshotsAction { | ||
table: Table, | ||
config: ExpireSnapshotsConfig, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking of something like this so it leverages the new Transaction code logic:
pub trait ExpireSnapshots: Send + Sync {
/// Trigger tx.commit, delete files if tx.commit succeeded, and collect ExpireSnapshotsResult
async fn execute(&self, catalog: &dyn Catalog) -> Result<ExpireSnapshotsResult>;
fn table(&self) -> Table; // returns the table that this action operates on so ExpireSnapshotsAction doesn't have to hold a Table
}
impl TransactionAction for ExpireSnapshotsAction {
async fn commit(self: Arc<Self>, table: &Table) -> Result<ActionCommit> {
// does the actual snapshots cleaning
// wraps updates and requirements into ActionCommit for Catalog to update metadata
}
}
This should help avoid using tx.apply
directly (#1455 (comment))
wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I follow, let me give it a shot.
@@ -76,7 +76,7 @@ impl Transaction { | |||
Ok(()) | |||
} | |||
|
|||
fn apply( | |||
pub(crate) fn apply( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the new Transaction API, Transaction should only hold actions
(pending actions to commit), updates
and requirements
should be removed after this PR: #1451
Based on this thought, tx.apply
shouldn't be accessible to anything except tx.commit
appreciate the feedback @CTTY -- I'll incorporate these changes and work on adding further tests. thanks! |
ok, thanks again @CTTY for the review. I think this is ready for another look. changes:
I am quite happy with how the result is factored. There are three primary entrypoints:
Next steps I'm considering:
@CTTY can you give this another look? |
In testing this on some larger tables, it's absolutely clear that some parallelism will be required in the initial load step in |
Hi, @cmcarthur This pr is too large to review. I would suggest to split them into several small prs, for example, the |
@liurenjie1024 thanks for the feedback!
agreed, it is large, I will break it into at least two smaller PRs
I hear you, and it makes sense to me to move this procedure out of the core library. But, do you mean that this procedure fits in this repo, but belongs in a different crate? Or that it doesn’t fit this repository? Implementing this in datafusion (say, as a UDF) does make sense, but I don’t think it should be the only way to call this. This procedure should exist with public rust apis for integration into systems that don’t depend on datafusion. How should I proceed? I could add a new crate in this repository, or implement the procedure in a separate repository. Open to your recommendations. |
The procedure implementation relies on a distributed(or parallel) computing engine. You may argue expire snapshot doesn't, but when it comes others like compaction, rewrite data files, we do need a compute engine. The reason I suggest to implement it in datafusion is that, as described in your issue, we will provide a working version for the community. Also it could serve as an example of integrating with the core library for other compute engines.
Ideally we could implement one, but it's difficult in rust. This involves designing an abstraction over different async runtime, memory management, etc. This is a complicated task which requires careful design, and collaboration from different compute engine community. For now, I don't see much value on this. |
Which issue does this PR close?
What changes are included in this PR?
ExpireSnapshotsAction
with extensive test coverageAre these changes tested?
I've added the following unit tests covering the happy path logic in the action:
test_builder_pattern
: unit test on the builder patterntest_collect_files_to_delete_logic
: unit test oncollect_files_to_delete
test_default_configuration
: tests defaults on builder patterntest_dry_run
: asserts that whenexecute
is called in dry run mode, no files are deleted and no transaction is generated or appliedThe remaining tests setup various scenarios and call the full action to ensure the correct commit is generated and applied, and the correct files are deleted: