Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot delete failing to delete snapshots for very large buckets #850

Open
bryanlb opened this issue Apr 9, 2024 · 2 comments
Open

Snapshot delete failing to delete snapshots for very large buckets #850

bryanlb opened this issue Apr 9, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@bryanlb
Copy link
Contributor

bryanlb commented Apr 9, 2024

Describe the bug

For what appears to be very large buckets, we have ceased to correctly delete snapshots

{"@timestamp":"2024-04-08T20:16:29.384Z","log.level":"INFO","log.message":"Completed snapshot deletion - successfully deleted 35 snapshots, failed to delete 853 snapshots in 900172 ms","process.thread.name":"SnapshotDeletionService RUNNING","log.logger":"com.slack.astra.clusterManager.SnapshotDeletionService"}

This appears we're hitting the scheduled task timeout (15 mins / 900,000ms), but only deleting 35 snapshots within that time.

{"@timestamp":"2024-04-08T20:16:29.319Z","log.level":"ERROR","log.message":"Exception deleting snapshot","process.thread.name":"snapshot-deletion-service-0","log.logger":"com.slack.astra.clusterManager.SnapshotDeletionService","error_type":"java.io.IOException","error_message":"java.util.concurrent.ExecutionException: software.amazon.awssdk.core.exception.SdkClientException","error_stack_trace":"java.io.IOException: java.util.concurrent.ExecutionException: software.amazon.awssdk.core.exception.SdkClientException
at com.slack.astra.blobfs.s3.S3CrtBlobFs.isDirectory(S3CrtBlobFs.java:604)
at com.slack.astra.blobfs.s3.S3CrtBlobFs.exists(S3CrtBlobFs.java:413)
at com.slack.astra.clusterManager.SnapshotDeletionService.lambda$deleteExpiredSnapshotsWithoutReplicas$5(SnapshotDeletionService.java:201)
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75)
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)\nCaused by: java.util.concurrent.ExecutionException: software.amazon.awssdk.core.exception.SdkClientException
at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073)
at com.slack.astra.blobfs.s3.S3CrtBlobFs.isDirectory(S3CrtBlobFs.java:598)\n\t... 8 more\nCaused by: software.amazon.awssdk.core.exception.SdkClientException
at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:111)
at software.amazon.awssdk.core.internal.http.AmazonAsyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonAsyncHttpClient.java:219)
at software.amazon.awssdk.core.internal.handler.BaseAsyncClientHandler.invoke(BaseAsyncClientHandler.java:288)
at software.amazon.awssdk.core.internal.handler.BaseAsyncClientHandler.doExecute(BaseAsyncClientHandler.java:227)
at software.amazon.awssdk.core.internal.handler.BaseAsyncClientHandler.lambda$execute$1(BaseA...","error_root_cause_class_name":"software.amazon.awssdk.core.exception.SdkInterruptedException","error_root_cause_message":null,"error_root_cause_stack_trace":"software.amazon.awssdk.core.exception.SdkInterruptedException
at software.amazon.awssdk.core.internal.http.InterruptMonitor.checkInterrupted(InterruptMonitor.java:40)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApplyTransactionIdStage.execute(ApplyTransactionIdStage.java:43)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApplyTransactionIdStage.execute(ApplyTransactionIdStage.java:29)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.AmazonAsyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonAsyncHttpClient.java:215)
at software.amazon.awssdk.core.internal.handler.BaseAsyncClientHandler.invoke(BaseAsyncClientHandler..."}
@bryanlb bryanlb added the bug Something isn't working label Apr 9, 2024
@bryanlb bryanlb changed the title Snapshot delet failing to delete snapshots for very large buckets Snapshot delete failing to delete snapshots for very large buckets Apr 9, 2024
@bryanlb
Copy link
Contributor Author

bryanlb commented Apr 9, 2024

Interestingly, all of the terminated calls appear to be stuck on the isDirectory logic. This may necessitate reworking this logic so we either don't need to check if it's a directory first, or making the directory check faster.

@bryanlb
Copy link
Contributor Author

bryanlb commented Apr 9, 2024

This appears to be an issue with the current blobfs design, as it heavily relies on the assumption that the object store functions with directory-style functionality. This is a problem because S3 does have a traditional understanding of "folders" so attempting to operate on "directory" or "folder" level operations gets extremely slow with the larger amount of files stored (due to many ListDirectory calls).

Amazon does recommend using a secondary index if you are attempting to perform operations like this.

Our recommended path is probably to replace the existing blobfs design to one that doesn't rely on directory discovery, storing the specific file assets. This will require moving the SnapshotMetadata to store the exact list of files as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant