Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge files fails with small or large files #778

Open
artttt opened this issue Sep 4, 2023 · 12 comments
Open

merge files fails with small or large files #778

artttt opened this issue Sep 4, 2023 · 12 comments

Comments

@artttt
Copy link

artttt commented Sep 4, 2023

Sorry if there is already an issue for this. the word "merge" is difficult to search for due to its use in git.

The merge function throws an error if any parts are outside the valid part size range for an s3 multipart upload.

Large input files can become multiple parts using the CopySourceRange parameter of upload_part_copy
Smaller parts will either need at least 5MiB of data to be downloaded or temporary files on s3.

Here are some failing tests

def test_merge_small_part(s3):
    with s3.open(a, "wb") as f:
        f.write(b"a" * 4 * 2**20)

    with s3.open(b, "wb") as f:
        f.write(b"a" * 10 * 2**20)
    s3.merge(test_bucket_name + "/joined", [a, b])
    assert s3.info(test_bucket_name + "/joined")["size"] == (4 * 2**20) + (10 * 2**20)

def test_merge_large_part(s3):
    with s3.open(a, "wb") as f:
        f.write(b"a" * 6 * 2**30)

    with s3.open(b, "wb") as f:
        f.write(b"a" * 10 * 2**20)
    s3.merge(test_bucket_name + "/joined", [a, b])
    assert s3.info(test_bucket_name + "/joined")["size"] == (6 * 2**30) + (10 * 2**20)
@martindurant
Copy link
Member

You are quite right, and this is a limitation of the AWS API. We could implement something which downloads and re-writes smaller chunks. That is basically what open(..., mode="a") does when the existing file is smaller than the 5MB limit. It might get complicated, however, for something like:

file 1: 5GB chunk, 1MB chunk
file 2: 5GB chunk

would require downloading and reuploading the whole of 1MB + 5GB. We are probably better off not supporting this.

@artttt
Copy link
Author

artttt commented Sep 6, 2023

That example isn't overly complicated as file 1 can be split into 2 compliant chunks.
This one is a little trickier

file 1: 5GB chunk
file 2: 1MB chunk
file 3: 5GB chunk

For this it wouldn't need to fully download file 3 only 4MiB from the start

There is also a route without downloading data by merging file 1 and file 2 to a temp file and then merging the temp file and file 3.

I may have to do one of these for some work. I will report back if i develop something generally useful.

@martindurant
Copy link
Member

I believe the whole point of the limitations in S3 merge, is that the remote service will not split chunks at all. So in your example, you can read 4MB of file3, but you cannot copy the remaining (5GB - 4MB) server-side - you must copy the whole chunk or not at all.

That example isn't overly complicated as file 1 can be split into 2 compliant chunks.

I don't think it can, you cannot split any chunk.

@artttt
Copy link
Author

artttt commented Sep 7, 2023

If this was the case then _copy_managed would have plenty of issues with files written with unusual part sizes but it doesn't.

A little experimentation tells me this isn't an issue as long as parts are within the api size limits.

out_path = f"{test_bucket_name}/test/mpu_parts.tif"

# specify some files and byte ranges wanted from those files
# these are not part aligned in the source files
copy_parts = [(a,5,(10 * 2**20)+5),
              (b,0,100),
            ]
bucket, key, _ = s3.split_path(out_path)

mpu = s3.call_s3("create_multipart_upload", Bucket=bucket, Key=key)

out_path = f"{test_bucket_name}/test/mpu_parts.tif"
copy_parts = [(a,5,(10 * 2**20)+5),
              (b,0,100),
            ]


bucket, key, _ = s3.split_path(out_path)

mpu = s3.call_s3("create_multipart_upload", Bucket=bucket, Key=key)

mpu_parts = []
for part_number,(source_path,brange_first, brange_last) in enumerate(copy_parts,start=1):
    # brange_last-1 to match standard python slice semantics
    part_mpu = s3.call_s3(
                    "upload_part_copy",
                    Bucket=mpu["Bucket"],
                    Key=mpu["Key"],
                    PartNumber=part_number,
                    UploadId=mpu["UploadId"],
                    CopySource=source_path,
                    CopySourceRange=f"bytes={brange_first}-{brange_last-1}",
                )
    mpu_parts.append({"PartNumber": part_number, "ETag": part_mpu["CopyPartResult"]["ETag"]})
        
    
fin = s3.call_s3(
            "complete_multipart_upload",
            Bucket=bucket,
            Key=key,
            UploadId=mpu["UploadId"],
            MultipartUpload={"Parts": mpu_parts},
        )
assert s3.info(out_path)["size"] == (10 * 2**20) + 100

As to the intent by AWS I suspect it is because parts are not merged internally on S3 and storing a small file for each part is inefficient. There is supposedly some performance benefit from making requests that align to part boundaries but i've never tested this.

@martindurant
Copy link
Member

x-amz-copy-source-range
The range of bytes to copy from the source object. The range value must use the form bytes=first-last, where the first and last are the zero-based byte offsets to copy. For example, bytes=0-9 indicates that you want to copy the first 10 bytes of the source. You can copy a range only if the source object is greater than 5 MB.

The restriction on the minimum chunk size does not apply to the very last chunk of the output file, however. The following should fail:

copy_parts = [
    (a,5,(10 * 2**20)+5),
    (b,0,100),
    (a,5,(10 * 2**20)+5)
]

However, I now see where you are coming from with regard to CopySourceRange, and it should be possible to merge 5MB minimum chunks by download and everything else by remote copy, but the logic to do this seems a little complex.

@martindurant
Copy link
Member

@artttt , do you think you would like to work towards a solution?

@artttt
Copy link
Author

artttt commented Oct 5, 2023

Yes I would like to.
I'm about to be away from the computer for a while so I will pencil it into my schedule to have a look in late Oct.

@martindurant
Copy link
Member

ping - this got lost

@artttt
Copy link
Author

artttt commented Dec 5, 2023

@martindurant apologies for the delay. This didn't get lost, i've been unwell.

I've done some exploring towards a solution. It seems to work well. As you identified the logic gets a little complex but i think i've managed to handle most of the edge cases that i could identify.

It's not very pretty at this stage so i've dropped it into a gist rather then a PR for the moment.

https://gist.github.com/artttt/a5e9a080936615d9519104db85d03ecb

@martindurant
Copy link
Member

Thanks for having a go.
That does look rather long!
Please ping me when pydata is over and I should have time to look.

@martindurant
Copy link
Member

Do you think you will have the chance to clean up that code and contribute a PR?

@artttt
Copy link
Author

artttt commented Apr 24, 2024

Hi Martin,
Unfortunately i'm unwell again and not able to do any work on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants