[v2] Validate full object checksum on multipart downloads#10180
[v2] Validate full object checksum on multipart downloads#10180
Conversation
AndrewAsseily
left a comment
There was a problem hiding this comment.
Nice PR! I manually tested all 4 scenarios from the description with a 20MB file with multipart and single GET for each. Everything checks out. Left a few inline comments.
| ) | ||
| finalize_download_invoker.finalize() | ||
|
|
||
| def _finalize_download(self, pre_finalize_callbacks, finalize_callback): |
There was a problem hiding this comment.
Looks like when a part download fails or the transfer is cancelld, the counter still hits zero and _finalize_download runs. If some parts don't have their checksums registered with the combiner, combine_and_validate hits a KeyError on the missing part. Should we check self._transfer_coordinator.exception at the top and bail out early if the transfer already failed?
There was a problem hiding this comment.
As a secondary guard, _get_combined_bytes could validate that len(self._parts) == self._num_parts before iterating.
Current State
s3transferdoesn't perform any checksum calculation or validation. This is handled bybotocoreunder the following conditions (assuming checksum mode is enabled):botocorecalculates and validates the full object checksum.botocorecalculates and validates the part-level checksum for each part retrieved.There are 2 gaps here:
botocoreperforms no validation.botocoreperforms no validation. There's not much the client can do here. It would need to know the exact part sizes the object was originally uploaded with and then use those offsets to independently calculate the composite checksum.This PR is designed to address the first gap by having
s3transfercalculate and validate the full object checksum.Solution
Calculating and validating full object checksums is constrained to CRC-based algorithms because because CRC algorithms have a property where combining multiple part-level checksums produces the same final value as if the checksum was calculated in a single, serial stream. This is important because as the transfer manager downloads multiple parts in parallel, the part bodies won't be blocked from being released after write. If full object checksums had to be calculated in a single, serial stream, then the part bodies would be blocked waiting to update a single checksum object. We use CRT's CRC combine functions for this.
The design makes calculating part-level checksums the responsibility of
botocoreand combining part-level checksums into a single full object checksum the responsibility ofs3transfer.botocore'sStreamingChecksumBodyalready calculates the checksum as the body is read from stream, and returns it tos3transfer. However, if the object was downloaded without checksum mode enabled, then it won't be returned as aStreamingChecksumBodyobject. In this case,s3transferwill wrap the body intoStreamingChecksumBodyso the checksum is calculated. This prevents double-computation of checksums. One tradeoff here is that checking to see if the returned body has a checksum attribute creates some coupling betweenbotocoreands3transfer, but I couldn't think of a way to cleanly separate responsibilities without introducing any coupling.When
s3transferinitiates a multipart download, it decides from theHeadObjectresponse whether or not it should calculate the full object checksum. If yes, then it creates aFullObjectChecksumCombinerobject. As each part is downloaded, the part-level checksums are downloaded and stored in theFullObjectChecksumCombinerobject. Once all parts have been downloaded,FullObjectChecksumCombinercombines all the part-level checksums and validates the full object checksum against the stored value returned from the initialHeadObject.Manual Testing
Reviewer should also sanity check here.
s3transfercalculates part checksum since S3 doesn't return part-level checksums when object has full object checksum, sobotocoredoesn't do any calculation.s3transferalso combines part checksums and validates full object checksum.botocorecalculates and validates full object checksum.s3transfercalculates part checksum and validates full object checksum.botocorecalculates and validates full object checksum.botocorecalculates part checksum and validates part checksums. No full object checksum validation is performed.Current Statesection.s3transfercalculates part checksum and validates full object checksum.botocorecalculates and validates full object checksum.