Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve error handling on upload failures #216

Open
stmcginnis opened this issue Dec 17, 2022 · 3 comments
Open

Improve error handling on upload failures #216

stmcginnis opened this issue Dec 17, 2022 · 3 comments

Comments

@stmcginnis
Copy link
Contributor

Coldsnap has built in retries with increasing backoff delays when uploading blocks. It can be hard to tell what is happening during this time since there is no output while the retries are happening.

https://github.com/awslabs/coldsnap/blob/develop/src/upload.rs#L171-L188

It might be useful to add a --verbose flag to the command to be able to get a little more insight into what is going on. Or just default to emit some sort of warning message that a retry is happening.

The number of times retries happen also seems to be a little too high.SNAPSHOT_BLOCK_ATTEMPTS is current set to 12. It seems likely that if the upload does not succeed after 3-5 attempts, it's not going to.

It would also be good if coldsnap recognized some failures that are not worth retrying as they are not transient failures. Things like AccessDeniedException as @grosser encountered in bottlerocket-os/bottlerocket#2667 should just immediately fail:

Failed to put block 1551 for snapshot 'snap-0f48e9c316f6fa504': TransientError: connection closed before message completed
Failed to put block 1552 for snapshot 'snap-0f48e9c316f6fa504': AccessDeniedException: User: arn:aws:sts::589470546123:assumed-role/compute-arf/[email protected] is not authorized to perform: ebs:PutSnapshotBlock on resource: arn:aws:ec2:us-west-2::snapshot/snap-0f48e9c316f6fa504 because no identity-based policy allows the ebs:PutSnapshotBlock action
@grosser
Copy link

grosser commented Dec 17, 2022

Background: Currently permission failure will wait for >1.5h before showing any sign of what permission was missing. Shortening that interval and possibly showing a warning early would be nice.

@stmcginnis
Copy link
Contributor Author

Also related: #167

@stmcginnis
Copy link
Contributor Author

Reading up on the calls used here, the aws-sdk-rust client is also doing retries internally.

From the docs:

The AWS SDKs implement automatic retry logic for requests that return error responses. You can configure the retry settings for the AWS SDKs. For more information, refer to the documentation for the SDK that you are using.

So inside the loop where coldsnap is performing retries with a backoff, the calls to PutSnapshotBlock are also doing retries. This could explain why even though the code has time::sleep(Duration::from_secs(attempt * SNAPSHOT_BLOCK_RETRY_SCALE)) with a retry scale of 2 and max attempts of 12 this was taking much longer than that time eventually time out.

In addition to recognizing non-retryable errors, coldsnap should also configure the EBS client to limit the number of internal retries the library tries to do:

https://github.com/awslabs/aws-sdk-rust/blob/6f91cf455d1d7987b3ed2283ba29a07285957ac3/examples/sdk-config/src/bin/set_retries.rs#L80

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants