-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
merge files fails with small or large files #778
Comments
You are quite right, and this is a limitation of the AWS API. We could implement something which downloads and re-writes smaller chunks. That is basically what open(..., mode="a") does when the existing file is smaller than the 5MB limit. It might get complicated, however, for something like:
would require downloading and reuploading the whole of 1MB + 5GB. We are probably better off not supporting this. |
That example isn't overly complicated as file 1 can be split into 2 compliant chunks.
For this it wouldn't need to fully download file 3 only 4MiB from the start There is also a route without downloading data by merging file 1 and file 2 to a temp file and then merging the temp file and file 3. I may have to do one of these for some work. I will report back if i develop something generally useful. |
I believe the whole point of the limitations in S3 merge, is that the remote service will not split chunks at all. So in your example, you can read 4MB of file3, but you cannot copy the remaining (5GB - 4MB) server-side - you must copy the whole chunk or not at all.
I don't think it can, you cannot split any chunk. |
If this was the case then _copy_managed would have plenty of issues with files written with unusual part sizes but it doesn't. A little experimentation tells me this isn't an issue as long as parts are within the api size limits. out_path = f"{test_bucket_name}/test/mpu_parts.tif"
# specify some files and byte ranges wanted from those files
# these are not part aligned in the source files
copy_parts = [(a,5,(10 * 2**20)+5),
(b,0,100),
]
bucket, key, _ = s3.split_path(out_path)
mpu = s3.call_s3("create_multipart_upload", Bucket=bucket, Key=key)
out_path = f"{test_bucket_name}/test/mpu_parts.tif"
copy_parts = [(a,5,(10 * 2**20)+5),
(b,0,100),
]
bucket, key, _ = s3.split_path(out_path)
mpu = s3.call_s3("create_multipart_upload", Bucket=bucket, Key=key)
mpu_parts = []
for part_number,(source_path,brange_first, brange_last) in enumerate(copy_parts,start=1):
# brange_last-1 to match standard python slice semantics
part_mpu = s3.call_s3(
"upload_part_copy",
Bucket=mpu["Bucket"],
Key=mpu["Key"],
PartNumber=part_number,
UploadId=mpu["UploadId"],
CopySource=source_path,
CopySourceRange=f"bytes={brange_first}-{brange_last-1}",
)
mpu_parts.append({"PartNumber": part_number, "ETag": part_mpu["CopyPartResult"]["ETag"]})
fin = s3.call_s3(
"complete_multipart_upload",
Bucket=bucket,
Key=key,
UploadId=mpu["UploadId"],
MultipartUpload={"Parts": mpu_parts},
)
assert s3.info(out_path)["size"] == (10 * 2**20) + 100 As to the intent by AWS I suspect it is because parts are not merged internally on S3 and storing a small file for each part is inefficient. There is supposedly some performance benefit from making requests that align to part boundaries but i've never tested this. |
The restriction on the minimum chunk size does not apply to the very last chunk of the output file, however. The following should fail:
However, I now see where you are coming from with regard to |
@artttt , do you think you would like to work towards a solution? |
Yes I would like to. |
ping - this got lost |
@martindurant apologies for the delay. This didn't get lost, i've been unwell. I've done some exploring towards a solution. It seems to work well. As you identified the logic gets a little complex but i think i've managed to handle most of the edge cases that i could identify. It's not very pretty at this stage so i've dropped it into a gist rather then a PR for the moment. https://gist.github.com/artttt/a5e9a080936615d9519104db85d03ecb |
Thanks for having a go. |
Do you think you will have the chance to clean up that code and contribute a PR? |
Hi Martin, |
Sorry if there is already an issue for this. the word "merge" is difficult to search for due to its use in git.
The merge function throws an error if any parts are outside the valid part size range for an s3 multipart upload.
Large input files can become multiple parts using the CopySourceRange parameter of upload_part_copy
Smaller parts will either need at least 5MiB of data to be downloaded or temporary files on s3.
Here are some failing tests
The text was updated successfully, but these errors were encountered: