merge files fails with small or large files #778

artttt · 2023-09-04T03:45:52Z

Sorry if there is already an issue for this. the word "merge" is difficult to search for due to its use in git.

The merge function throws an error if any parts are outside the valid part size range for an s3 multipart upload.

Large input files can become multiple parts using the CopySourceRange parameter of upload_part_copy
Smaller parts will either need at least 5MiB of data to be downloaded or temporary files on s3.

Here are some failing tests

def test_merge_small_part(s3):
    with s3.open(a, "wb") as f:
        f.write(b"a" * 4 * 2**20)

    with s3.open(b, "wb") as f:
        f.write(b"a" * 10 * 2**20)
    s3.merge(test_bucket_name + "/joined", [a, b])
    assert s3.info(test_bucket_name + "/joined")["size"] == (4 * 2**20) + (10 * 2**20)

def test_merge_large_part(s3):
    with s3.open(a, "wb") as f:
        f.write(b"a" * 6 * 2**30)

    with s3.open(b, "wb") as f:
        f.write(b"a" * 10 * 2**20)
    s3.merge(test_bucket_name + "/joined", [a, b])
    assert s3.info(test_bucket_name + "/joined")["size"] == (6 * 2**30) + (10 * 2**20)

martindurant · 2023-09-05T14:04:08Z

You are quite right, and this is a limitation of the AWS API. We could implement something which downloads and re-writes smaller chunks. That is basically what open(..., mode="a") does when the existing file is smaller than the 5MB limit. It might get complicated, however, for something like:

file 1: 5GB chunk, 1MB chunk
file 2: 5GB chunk

would require downloading and reuploading the whole of 1MB + 5GB. We are probably better off not supporting this.

artttt · 2023-09-06T04:06:19Z

That example isn't overly complicated as file 1 can be split into 2 compliant chunks.
This one is a little trickier

file 1: 5GB chunk
file 2: 1MB chunk
file 3: 5GB chunk

For this it wouldn't need to fully download file 3 only 4MiB from the start

There is also a route without downloading data by merging file 1 and file 2 to a temp file and then merging the temp file and file 3.

I may have to do one of these for some work. I will report back if i develop something generally useful.

martindurant · 2023-09-06T14:50:00Z

I believe the whole point of the limitations in S3 merge, is that the remote service will not split chunks at all. So in your example, you can read 4MB of file3, but you cannot copy the remaining (5GB - 4MB) server-side - you must copy the whole chunk or not at all.

That example isn't overly complicated as file 1 can be split into 2 compliant chunks.

I don't think it can, you cannot split any chunk.

artttt · 2023-09-07T03:23:06Z

If this was the case then _copy_managed would have plenty of issues with files written with unusual part sizes but it doesn't.

A little experimentation tells me this isn't an issue as long as parts are within the api size limits.

out_path = f"{test_bucket_name}/test/mpu_parts.tif"

# specify some files and byte ranges wanted from those files
# these are not part aligned in the source files
copy_parts = [(a,5,(10 * 2**20)+5),
              (b,0,100),
            ]
bucket, key, _ = s3.split_path(out_path)

mpu = s3.call_s3("create_multipart_upload", Bucket=bucket, Key=key)

out_path = f"{test_bucket_name}/test/mpu_parts.tif"
copy_parts = [(a,5,(10 * 2**20)+5),
              (b,0,100),
            ]


bucket, key, _ = s3.split_path(out_path)

mpu = s3.call_s3("create_multipart_upload", Bucket=bucket, Key=key)

mpu_parts = []
for part_number,(source_path,brange_first, brange_last) in enumerate(copy_parts,start=1):
    # brange_last-1 to match standard python slice semantics
    part_mpu = s3.call_s3(
                    "upload_part_copy",
                    Bucket=mpu["Bucket"],
                    Key=mpu["Key"],
                    PartNumber=part_number,
                    UploadId=mpu["UploadId"],
                    CopySource=source_path,
                    CopySourceRange=f"bytes={brange_first}-{brange_last-1}",
                )
    mpu_parts.append({"PartNumber": part_number, "ETag": part_mpu["CopyPartResult"]["ETag"]})
        
    
fin = s3.call_s3(
            "complete_multipart_upload",
            Bucket=bucket,
            Key=key,
            UploadId=mpu["UploadId"],
            MultipartUpload={"Parts": mpu_parts},
        )
assert s3.info(out_path)["size"] == (10 * 2**20) + 100

As to the intent by AWS I suspect it is because parts are not merged internally on S3 and storing a small file for each part is inefficient. There is supposedly some performance benefit from making requests that align to part boundaries but i've never tested this.

martindurant · 2023-09-07T13:29:24Z

x-amz-copy-source-range
The range of bytes to copy from the source object. The range value must use the form bytes=first-last, where the first and last are the zero-based byte offsets to copy. For example, bytes=0-9 indicates that you want to copy the first 10 bytes of the source. You can copy a range only if the source object is greater than 5 MB.

The restriction on the minimum chunk size does not apply to the very last chunk of the output file, however. The following should fail:

copy_parts = [
    (a,5,(10 * 2**20)+5),
    (b,0,100),
    (a,5,(10 * 2**20)+5)
]

However, I now see where you are coming from with regard to CopySourceRange, and it should be possible to merge 5MB minimum chunks by download and everything else by remote copy, but the logic to do this seems a little complex.

martindurant · 2023-10-04T15:40:57Z

@artttt , do you think you would like to work towards a solution?

artttt · 2023-10-05T01:23:04Z

Yes I would like to.
I'm about to be away from the computer for a while so I will pencil it into my schedule to have a look in late Oct.

martindurant · 2023-11-20T19:04:47Z

ping - this got lost

artttt · 2023-12-05T10:14:47Z

@martindurant apologies for the delay. This didn't get lost, i've been unwell.

I've done some exploring towards a solution. It seems to work well. As you identified the logic gets a little complex but i think i've managed to handle most of the edge cases that i could identify.

It's not very pretty at this stage so i've dropped it into a gist rather then a PR for the moment.

https://gist.github.com/artttt/a5e9a080936615d9519104db85d03ecb

martindurant · 2023-12-05T15:43:36Z

Thanks for having a go.
That does look rather long!
Please ping me when pydata is over and I should have time to look.

martindurant · 2024-04-22T19:55:10Z

Do you think you will have the chance to clean up that code and contribute a PR?

artttt · 2024-04-24T00:02:31Z

Hi Martin,
Unfortunately i'm unwell again and not able to do any work on this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge files fails with small or large files #778

merge files fails with small or large files #778

artttt commented Sep 4, 2023

martindurant commented Sep 5, 2023

artttt commented Sep 6, 2023 •

edited

Loading

martindurant commented Sep 6, 2023

artttt commented Sep 7, 2023

martindurant commented Sep 7, 2023

martindurant commented Oct 4, 2023

artttt commented Oct 5, 2023

martindurant commented Nov 20, 2023

artttt commented Dec 5, 2023 •

edited

Loading

martindurant commented Dec 5, 2023

martindurant commented Apr 22, 2024

artttt commented Apr 24, 2024

merge files fails with small or large files #778

merge files fails with small or large files #778

Comments

artttt commented Sep 4, 2023

martindurant commented Sep 5, 2023

artttt commented Sep 6, 2023 • edited Loading

martindurant commented Sep 6, 2023

artttt commented Sep 7, 2023

martindurant commented Sep 7, 2023

martindurant commented Oct 4, 2023

artttt commented Oct 5, 2023

martindurant commented Nov 20, 2023

artttt commented Dec 5, 2023 • edited Loading

martindurant commented Dec 5, 2023

martindurant commented Apr 22, 2024

artttt commented Apr 24, 2024

artttt commented Sep 6, 2023 •

edited

Loading

artttt commented Dec 5, 2023 •

edited

Loading