You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For what appears to be very large buckets, we have ceased to correctly delete snapshots
{"@timestamp":"2024-04-08T20:16:29.384Z","log.level":"INFO","log.message":"Completed snapshot deletion - successfully deleted 35 snapshots, failed to delete 853 snapshots in 900172 ms","process.thread.name":"SnapshotDeletionService RUNNING","log.logger":"com.slack.astra.clusterManager.SnapshotDeletionService"}
This appears we're hitting the scheduled task timeout (15 mins / 900,000ms), but only deleting 35 snapshots within that time.
{"@timestamp":"2024-04-08T20:16:29.319Z","log.level":"ERROR","log.message":"Exception deleting snapshot","process.thread.name":"snapshot-deletion-service-0","log.logger":"com.slack.astra.clusterManager.SnapshotDeletionService","error_type":"java.io.IOException","error_message":"java.util.concurrent.ExecutionException: software.amazon.awssdk.core.exception.SdkClientException","error_stack_trace":"java.io.IOException: java.util.concurrent.ExecutionException: software.amazon.awssdk.core.exception.SdkClientException
at com.slack.astra.blobfs.s3.S3CrtBlobFs.isDirectory(S3CrtBlobFs.java:604)
at com.slack.astra.blobfs.s3.S3CrtBlobFs.exists(S3CrtBlobFs.java:413)
at com.slack.astra.clusterManager.SnapshotDeletionService.lambda$deleteExpiredSnapshotsWithoutReplicas$5(SnapshotDeletionService.java:201)
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:131)
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:75)
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:82)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)\nCaused by: java.util.concurrent.ExecutionException: software.amazon.awssdk.core.exception.SdkClientException
at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:396)
at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2073)
at com.slack.astra.blobfs.s3.S3CrtBlobFs.isDirectory(S3CrtBlobFs.java:598)\n\t... 8 more\nCaused by: software.amazon.awssdk.core.exception.SdkClientException
at software.amazon.awssdk.core.exception.SdkClientException$BuilderImpl.build(SdkClientException.java:111)
at software.amazon.awssdk.core.internal.http.AmazonAsyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonAsyncHttpClient.java:219)
at software.amazon.awssdk.core.internal.handler.BaseAsyncClientHandler.invoke(BaseAsyncClientHandler.java:288)
at software.amazon.awssdk.core.internal.handler.BaseAsyncClientHandler.doExecute(BaseAsyncClientHandler.java:227)
at software.amazon.awssdk.core.internal.handler.BaseAsyncClientHandler.lambda$execute$1(BaseA...","error_root_cause_class_name":"software.amazon.awssdk.core.exception.SdkInterruptedException","error_root_cause_message":null,"error_root_cause_stack_trace":"software.amazon.awssdk.core.exception.SdkInterruptedException
at software.amazon.awssdk.core.internal.http.InterruptMonitor.checkInterrupted(InterruptMonitor.java:40)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApplyTransactionIdStage.execute(ApplyTransactionIdStage.java:43)
at software.amazon.awssdk.core.internal.http.pipeline.stages.ApplyTransactionIdStage.execute(ApplyTransactionIdStage.java:29)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.pipeline.RequestPipelineBuilder$ComposingRequestPipelineStage.execute(RequestPipelineBuilder.java:206)
at software.amazon.awssdk.core.internal.http.AmazonAsyncHttpClient$RequestExecutionBuilderImpl.execute(AmazonAsyncHttpClient.java:215)
at software.amazon.awssdk.core.internal.handler.BaseAsyncClientHandler.invoke(BaseAsyncClientHandler..."}
The text was updated successfully, but these errors were encountered:
bryanlb
changed the title
Snapshot delet failing to delete snapshots for very large buckets
Snapshot delete failing to delete snapshots for very large buckets
Apr 9, 2024
Interestingly, all of the terminated calls appear to be stuck on the isDirectory logic. This may necessitate reworking this logic so we either don't need to check if it's a directory first, or making the directory check faster.
This appears to be an issue with the current blobfs design, as it heavily relies on the assumption that the object store functions with directory-style functionality. This is a problem because S3 does have a traditional understanding of "folders" so attempting to operate on "directory" or "folder" level operations gets extremely slow with the larger amount of files stored (due to many ListDirectory calls).
Amazon does recommend using a secondary index if you are attempting to perform operations like this.
Our recommended path is probably to replace the existing blobfs design to one that doesn't rely on directory discovery, storing the specific file assets. This will require moving the SnapshotMetadata to store the exact list of files as well.
Describe the bug
For what appears to be very large buckets, we have ceased to correctly delete snapshots
This appears we're hitting the scheduled task timeout (15 mins / 900,000ms), but only deleting 35 snapshots within that time.
The text was updated successfully, but these errors were encountered: