Skip to content

feature: expire snapshots action #1455

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

cmcarthur
Copy link

Which issue does this PR close?

What changes are included in this PR?

  • Adds ExpireSnapshotsAction with extensive test coverage

Are these changes tested?

I've added the following unit tests covering the happy path logic in the action:

  • test_builder_pattern: unit test on the builder pattern
  • test_collect_files_to_delete_logic: unit test on collect_files_to_delete
  • test_default_configuration: tests defaults on builder pattern
  • test_dry_run: asserts that when execute is called in dry run mode, no files are deleted and no transaction is generated or applied

The remaining tests setup various scenarios and call the full action to ensure the correct commit is generated and applied, and the correct files are deleted:

    maintenance::expire_snapshots::tests::test_empty_snapshot_ids_list
    maintenance::expire_snapshots::tests::test_expire_current_snapshot_error
    maintenance::expire_snapshots::tests::test_expire_readonly_table_error
    maintenance::expire_snapshots::tests::test_expire_snapshots_many_snapshots_retain_last
    maintenance::expire_snapshots::tests::test_expire_snapshots_older_than_timestamp
    maintenance::expire_snapshots::tests::test_expire_snapshots_one_snapshot
    maintenance::expire_snapshots::tests::test_expire_snapshots_zero_snapshots
    maintenance::expire_snapshots::tests::test_expire_specific_snapshot_ids
    maintenance::expire_snapshots::tests::test_max_concurrent_deletes_configuration
    maintenance::expire_snapshots::tests::test_nonexistent_snapshot_ids
    maintenance::expire_snapshots::tests::test_retain_last_minimum_validation

Copy link
Contributor

@CTTY CTTY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cmcarthur , thanks for the PR! I really like the overall design!

There may be some tweaks we can use to integrate with the transaction API better. I've put my thoughts in the comments

}

// update the table metadata to remove the expired snapshots _before_ deleting anything!
// TODO: make this retry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm planning to add retry logic in tx.commit, so you don't have to retry here

issue: #1387

pub struct ExpireSnapshotsAction {
table: Table,
config: ExpireSnapshotsConfig,
}
Copy link
Contributor

@CTTY CTTY Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of something like this so it leverages the new Transaction code logic:

pub trait ExpireSnapshots: Send + Sync {
    /// Trigger tx.commit, delete files if tx.commit succeeded, and collect ExpireSnapshotsResult
    async fn execute(&self, catalog: &dyn Catalog) -> Result<ExpireSnapshotsResult>;
    fn table(&self) -> Table; // returns the table that this action operates on so ExpireSnapshotsAction doesn't have to hold a Table 
}

impl TransactionAction for ExpireSnapshotsAction {
   async fn commit(self: Arc<Self>, table: &Table) -> Result<ActionCommit> {
    // does the actual snapshots cleaning
    // wraps updates and requirements into ActionCommit for Catalog to update metadata
  }
}

This should help avoid using tx.apply directly (#1455 (comment))

wdyt?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I follow, let me give it a shot.

@@ -76,7 +76,7 @@ impl Transaction {
Ok(())
}

fn apply(
pub(crate) fn apply(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new Transaction API, Transaction should only hold actions (pending actions to commit), updates and requirements should be removed after this PR: #1451

Based on this thought, tx.apply shouldn't be accessible to anything except tx.commit

@cmcarthur
Copy link
Author

appreciate the feedback @CTTY -- I'll incorporate these changes and work on adding further tests. thanks!

@cmcarthur cmcarthur requested a review from CTTY June 20, 2025 18:17
@cmcarthur
Copy link
Author

ok, thanks again @CTTY for the review. I think this is ready for another look. changes:

  • Added tests to cover more realistic scenarios where new snapshots reference new and old manifests, ensuring that old manifests are not deleted if they are still referenced.
  • Removed the visibility change to apply in transaction/mod.rs.
  • Refactored ExpireSnapshotAction to impl TransactionAction. Separated out (and renamed) the remaining bit to ExpireSnapshotProcedure

I am quite happy with how the result is factored. There are three primary entrypoints:

  • ExpireSnapshotsProcedureBuilder can be used to build an expire snapshots procedure. It's also callable from Table e.g. table.expire_snapshots()....build(). On build, you get a configured ExpireSnapshotsProcedure
  • You can then call ExpireSnapshotsProcedure.execute(table, catalog) to run the procedure in full
  • There is a lower level ExpireSnapshotsAction that implements TransactionAction and can be used to simply remove snapshots from a table

Next steps I'm considering:

  • Perhaps now that the procedure does not hold a table, it would make sense to remove the table.expire_snapshots() shorthand
  • Should ExpireSnapshotsAction actually be RemoveSnapshotsAction?

@CTTY can you give this another look?

@cmcarthur
Copy link
Author

In testing this on some larger tables, it's absolutely clear that some parallelism will be required in the initial load step in collect_files_to_delete -- calling load_manifest_list in a loop on a table with 10,000+ manifest lists is simply too slow. Parallelizing that step and gathering the results should be easy enough.

@liurenjie1024
Copy link
Contributor

Hi, @cmcarthur This pr is too large to review. I would suggest to split them into several small prs, for example, the ExpireSnapshotAction could be a good start. Also as mentioned in #1453 (comment), I don't think it's a good idea to put procedure into core library. You could try to extend datafusion to do this.

@cmcarthur
Copy link
Author

@liurenjie1024 thanks for the feedback!

This pr is too large to review.

agreed, it is large, I will break it into at least two smaller PRs

I don't think it's a good idea to put procedure into core library. You could try to extend datafusion to do this

I hear you, and it makes sense to me to move this procedure out of the core library. But, do you mean that this procedure fits in this repo, but belongs in a different crate? Or that it doesn’t fit this repository?

Implementing this in datafusion (say, as a UDF) does make sense, but I don’t think it should be the only way to call this. This procedure should exist with public rust apis for integration into systems that don’t depend on datafusion.

How should I proceed? I could add a new crate in this repository, or implement the procedure in a separate repository. Open to your recommendations.

@liurenjie1024
Copy link
Contributor

But, do you mean that this procedure fits in this repo, but belongs in a different crate? Or that it doesn’t fit this repository?

The procedure implementation relies on a distributed(or parallel) computing engine. You may argue expire snapshot doesn't, but when it comes others like compaction, rewrite data files, we do need a compute engine.

The reason I suggest to implement it in datafusion is that, as described in your issue, we will provide a working version for the community. Also it could serve as an example of integrating with the core library for other compute engines.

Implementing this in datafusion (say, as a UDF) does make sense, but I don’t think it should be the only way to call this. This procedure should exist with public rust apis for integration into systems that don’t depend on datafusion.

Ideally we could implement one, but it's difficult in rust. This involves designing an abstraction over different async runtime, memory management, etc. This is a complicated task which requires careful design, and collaboration from different compute engine community. For now, I don't see much value on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Maintenance: Expire Snapshots Action
3 participants