-
Notifications
You must be signed in to change notification settings - Fork 15
Description
Description / Background
Almost every time we run CompactionVeryLargeST, the compaction fails, and we see a failure in the log like "sleeper_df/src/lib.rs:110 - merging error Arrow error: External error: External: Generic S3 error: HTTP error: request or response body error". There's no stack trace or any more information than that.
The system test also times out rather than failing as soon as the compaction fails. It seems like it's not checking when the tracker updates the status to failed, it's waiting for it to succeed even after it's reported as failed. We could split that out as a separate problem.
Steps to reproduce
- Run CompactionVeryLargeST
- Usually see the test times out waiting for compaction to finish
- Check logs of compaction task
- See error
Expected behaviour
The test should pass. The compaction should reliably finish successfully.
Technical Notes / Implementation Details
Looking at the source code line being quoted, it looks like it must be a DataFusionError. Based on the error message it seems like it's a DataFusionError wrapping an ArrowError.
We have a couple of things to investigate:
- Does it still fail if ReadaheadStore is disabled?
- Is the output Parquet file that is apparently completed a valid Parquet file on S3? The automated tests tear down the instance afterwards so we couldn't check this.
We wrap the object store with a ReadaheadStore here:
https://github.com/gchq/sleeper/blob/develop/rust/objectstore_ext/src/s3.rs#L185
We could also consider how we could improve the logging to see where the error is happening.
Screenshots/Logs
Logs from compaction task:
[INFO] objectstore_ext/src/store.rs:258 - DataFusion multipart PUT to s3://sleeper-<instance-id>-table-data/<table-id>/data/partition_root/<job-id>.parquet of 10,123,456 bytes
[INFO] objectstore_ext/src/store.rs:258 - DataFusion multipart PUT to s3://sleeper-<instance-id>-table-data/<table-id>/data/partition_root/<job-id>.parquet of 10,123,456 bytes
[INFO] objectstore_ext/src/store.rs:258 - DataFusion multipart PUT to s3://sleeper-<instance-id>-table-data/<table-id>/data/partition_root/<job-id>.parquet of 9,123,456 bytes
[INFO] objectstore_ext/src/store.rs:274 - multipart to s3://sleeper-<instance-id>-table-data/<table-id>/data/partition_root/<job-id>.parquet COMPLETE
[INFO] objectstore_ext/src/store.rs:104 - LoggingObjectStore "DataFusion" to "s3://sleeper-<instance-id>-table-data" made 3,123 GET requests and requested a total of 29,123,456,789 bytes (range bounded)
[INFO] objectstore_ext/src/readahead.rs:443 - ReadaheadStore made 200 GET requests to underlying location s3://sleeper-<instance-id>-table-data
[ERROR] sleeper_df/src/lib.rs:110 - merging error Arrow error: External error: External: Generic S3 error: HTTP error: request or response body error
[main] compaction.datafusion.DataFusionCompactionRunner ERROR - DataFusion compaction failed, return code: -1
[main] core.task.CompactionTask ERROR - Failed processing compaction job, putting job back on queue
java.io.IOException: DataFusion compaction failed with return code -1
at sleeper.compaction.datafusion.DataFusionCompactionRunner.invokeDataFusion(DataFusionCompactionRunner.java:192)
at sleeper.compaction.datafusion.DataFusionCompactionRunner.compact(DataFusionCompactionRunner.java:82)
at sleeper.compaction.core.task.CompactionTask.compact(CompactionTask.java:230)
at sleeper.compaction.core.task.CompactionTask.processCompactionMessage(CompactionTask.java:180)
at sleeper.compaction.core.task.CompactionTask.handleMessages(CompactionTask.java:147)
at sleeper.compaction.core.task.CompactionTask.run(CompactionTask.java:123)
at sleeper.compaction.job.execution.ECSCompactionTaskRunner.main(ECSCompactionTaskRunner.java:120)
[main] job.execution.SqsCompactionQueueHandler INFO - Compaction job <job-id>: Returning message to queue
[main] tracker.job.DynamoDBCompactionJobTracker DEBUG - Added failed for job <job-id>, capacity consumed = 1.0, took 0.123 seconds
[main] tracker.job.DynamoDBCompactionJobTracker DEBUG - Updated status for job <job-id>, capacity consumed = 1.0, took 0.123 seconds
[main] job.execution.SqsCompactionQueueHandler INFO - Compaction job <job-id>: Stopping background thread to keep SQS messages alive