Snapshot delete failing to delete snapshots for very large buckets #850

bryanlb · 2024-04-09T00:00:40Z

Describe the bug

For what appears to be very large buckets, we have ceased to correctly delete snapshots

{"@timestamp":"2024-04-08T20:16:29.384Z","log.level":"INFO","log.message":"Completed snapshot deletion - successfully deleted 35 snapshots, failed to delete 853 snapshots in 900172 ms","process.thread.name":"SnapshotDeletionService RUNNING","log.logger":"com.slack.astra.clusterManager.SnapshotDeletionService"}

This appears we're hitting the scheduled task timeout (15 mins / 900,000ms), but only deleting 35 snapshots within that time.

{"@timestamp":"2024-04-08T20:16:29.319Z","log.level":"ERROR","log.message":"Exception deleting snapshot","process.thread.name":"snapshot-deletion-service-0","log.logger":"com.slack.astra.clusterManager.SnapshotDeletionService","error_type":"java.io.IOException","error_message":"java.util.concurrent.ExecutionException: software.amazon.awssdk.core.exception.SdkClientException","error_stack_trace":"java.io.IOException: java.util.concurrent.ExecutionException: software.amazon.awssdk.core.exception.SdkClientException
at com.slack.astra.blobfs.s3.S3CrtBlobFs.isDirectory(S3CrtBlobFs.java:604)
at com.slack.astra.blobfs.s3.S3CrtBlobFs.exists(S3CrtBlobFs.java:413)
at com.slack.astra.clusterManager.SnapshotDeletionService.lambda$deleteExpiredSnapshotsWithoutReplicas$5(SnapshotDeletionService.java:201)
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75)
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)\nCaused by: java.util.concurrent.ExecutionException: software.amazon.awssdk.core.exception.SdkClientException
at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073)
at com.slack.astra.blobfs.s3.S3CrtBlobFs.isDirectory(S3CrtBlobFs.java:598)\n\t... 8 more\nCaused by: software.amazon.awssdk.core.exception.SdkClientException
at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:111)
at software.amazon.awssdk.core.internal.http.AmazonAsyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonAsyncHttpClient.java:219)
at software.amazon.awssdk.core.internal.handler.BaseAsyncClientHandler.invoke(BaseAsyncClientHandler.java:288)
at software.amazon.awssdk.core.internal.handler.BaseAsyncClientHandler.doExecute(BaseAsyncClientHandler.java:227)
at software.amazon.awssdk.core.internal.handler.BaseAsyncClientHandler.lambda$execute$1(BaseA...","error_root_cause_class_name":"software.amazon.awssdk.core.exception.SdkInterruptedException","error_root_cause_message":null,"error_root_cause_stack_trace":"software.amazon.awssdk.core.exception.SdkInterruptedException
at software.amazon.awssdk.core.internal.http.InterruptMonitor.checkInterrupted(InterruptMonitor.java:40)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApplyTransactionIdStage.execute(ApplyTransactionIdStage.java:43)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApplyTransactionIdStage.execute(ApplyTransactionIdStage.java:29)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.AmazonAsyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonAsyncHttpClient.java:215)
at software.amazon.awssdk.core.internal.handler.BaseAsyncClientHandler.invoke(BaseAsyncClientHandler..."}

The text was updated successfully, but these errors were encountered:

bryanlb · 2024-04-09T00:01:58Z

Interestingly, all of the terminated calls appear to be stuck on the isDirectory logic. This may necessitate reworking this logic so we either don't need to check if it's a directory first, or making the directory check faster.

bryanlb · 2024-04-09T19:00:11Z

This appears to be an issue with the current blobfs design, as it heavily relies on the assumption that the object store functions with directory-style functionality. This is a problem because S3 does have a traditional understanding of "folders" so attempting to operate on "directory" or "folder" level operations gets extremely slow with the larger amount of files stored (due to many ListDirectory calls).

Amazon does recommend using a secondary index if you are attempting to perform operations like this.

Our recommended path is probably to replace the existing blobfs design to one that doesn't rely on directory discovery, storing the specific file assets. This will require moving the SnapshotMetadata to store the exact list of files as well.

bryanlb added the bug Something isn't working label Apr 9, 2024

bryanlb changed the title ~~Snapshot delet failing to delete snapshots for very large buckets~~ Snapshot delete failing to delete snapshots for very large buckets Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshot delete failing to delete snapshots for very large buckets #850

Snapshot delete failing to delete snapshots for very large buckets #850

bryanlb commented Apr 9, 2024

bryanlb commented Apr 9, 2024

bryanlb commented Apr 9, 2024

Snapshot delete failing to delete snapshots for very large buckets #850

Snapshot delete failing to delete snapshots for very large buckets #850

Comments

bryanlb commented Apr 9, 2024

Describe the bug

bryanlb commented Apr 9, 2024

bryanlb commented Apr 9, 2024