Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync to AWS #368

Open
djmitche opened this issue Dec 26, 2023 · 9 comments
Open

Sync to AWS #368

djmitche opened this issue Dec 26, 2023 · 9 comments

Comments

@djmitche
Copy link
Collaborator

Similar to the GCP sync implemented in GothenburgBitFactory/taskwarrior#3185, we should be able to sync replicas to AWS's object storage.

The tricky bit here is that, unlike GCS and Azure Blob, S3 does not provide a compare-and-swap operation.

Some reading suggests that the easiest way to accomplish this would be to use DynamoDB as a lock over the "latest" object in the S3 bucket.

@dathanb
Copy link
Sponsor Contributor

dathanb commented Dec 27, 2023

For the sake of simplicity, would it make more sense to use DynamoDB as the only store? Unless we're going to run up against the 400KB-per-item size limit for dynamo, it seems compelling to only have to configure a single cloud resource instead of two.

@djmitche
Copy link
Collaborator Author

I think that limit would be a problem, yes

@dathanb
Copy link
Sponsor Contributor

dathanb commented Dec 27, 2023

OK, got it. I haven't checked out the actual objects that get synced. I'll look at your PR for GCP to better understand what all gets sent.

@djmitche
Copy link
Collaborator Author

Great!

You can find more info on the sync protocol here. Ignore the HTTP bits for the cloud-storage case, but the rest still applies.

There's really no size limit on versions -- if a user is putting big chunks of text into their tasks, such as annotations, they might get quite large. Similarly, if they do not sync very often, an accumulation of small changes might get large. The former case doesn't really permit any technical solution -- nothing prevents a user from putting a 500KB annotation on a task, and that would need to be a single operation and thus included in a single version. Snapshots, too, have no size limit, and are proportional to the total number of tasks (of all statuses) a user has. Probably most users have relatively small task sets, but I'm sure there are people out there with 1000's or even more.

So, I think we need to take advantage of S3's more-or-less unlimited size. The alternative would be to store versions and snapshots in multiple DynamoDB items, but that seems difficult and would introduce some new failure modes.

@djmitche djmitche transferred this issue from GothenburgBitFactory/taskwarrior Apr 21, 2024
@dathanb
Copy link
Sponsor Contributor

dathanb commented Sep 9, 2024

As of August 20, AWS S3 now supports conditional writes, so I think this should be doable with bare S3 without needing additional coordination via Dynamo.

@djmitche
Copy link
Collaborator Author

djmitche commented Sep 9, 2024

Wow, new features in an 18-year old product!

It looks like this only supports checking whether a file exists, not whether it has been changed, so might require some additional work to figure out how to get the update-if-not-changed behavior we need.

@dathanb
Copy link
Sponsor Contributor

dathanb commented Sep 9, 2024

I haven't checked the sync protocol -- does it support any sort of pessimistic locking around the sync operation? If so, the AWS integration code could identify an object key that acts as a semaphore, and sync will only succeed if the client can acquire the semahpore / do a conditional put to that object. (Also as part of the initial setup a bucket would need to be configured for the semaphore object ot live in, and the bucket policy should have a TTL set so the object will automatically expire after some amount of time, to recover from the case where a client fails to release the semahpore / fails to delete the object.)

@djmitche
Copy link
Collaborator Author

djmitche commented Sep 9, 2024

That could work! The only bit that needs synchronization is latest, in a kind of test-and-swap fashion.

/// Compare the existing object's value with `existing_value`, and replace with `new_value`
/// only if the values match. Returns true if the replacement occurred.
fn compare_and_swap(
&mut self,
name: &[u8],
existing_value: Option<Vec<u8>>,
new_value: Vec<u8>,
) -> Result<bool>;

The locking could occur around that operation.

@dathanb
Copy link
Sponsor Contributor

dathanb commented Sep 9, 2024

Oh yeah, we can totally use pessimistic locking to implement compare-and-swap. 👍 Lemme see if I can carve out some time to work on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Backlog
Development

No branches or pull requests

2 participants