Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for fsspec.generic writes #748

Open
mukhery opened this issue Jun 11, 2023 · 1 comment
Open

support for fsspec.generic writes #748

mukhery opened this issue Jun 11, 2023 · 1 comment

Comments

@mukhery
Copy link

mukhery commented Jun 11, 2023

@martindurant we recently addressed reading from s3, so you can now use _cp_file to copy files from s3 to the local fs (or to another fs that supports writes), but we can't write to s3 using this function.

I believe the issue is that S3AsyncStreamedFile doesn't implement the write function, so it defaults to the parent class write. The parent class write then fails because self.closed is not handled properly.

Do you know why write isn't implemented? Can we just use _pipe_file something like this:

    async def write(self, data):
        if self.mode not in {"wb"}:
            raise ValueError("File not in write mode")
        await self.fs._pipe_file(self.path, data)
        if self.size is None:
            self.size = len(data)
        else:
            self.size += len(data)
        self.loc += len(data)
        return len(data)

Maybe check that put_object returns a 200, add and manage a chunksize setting, buffer the data and do a single _pipe_file or turn everything into a multipart_upload? What do you think? I can put a PR together if it would help. I'm interested in getting the fsspec.generic cp and rsync functionality fully supported for s3.

@martindurant
Copy link
Member

When starting this, I had hoped that actual streaming uploads were possible. However, aiohttp and derivatives use a pull model on upload rather than push: the caller must supply a file-like which supports async reading (as opposed to an async write method we can call). It may be possible to work with this model, but it would be painful.

The other option is essentially to replicate the logic in the blocking code, and expose the async calls there (in initiare_upload, fetch_range and upload_chunk) as - which would limit the sizes of the individual calls as allowed by S3 rules.

pipe_file would not be useful on its own, since it is a call to send a whole file in one go. It may be async, but there is no way to pause mid-way to read more data nor a way to repeatedly call it for the same file (it would overwrite each time).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants