Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow simple text as input for upload_data_to_dataset #78

Open
mhaas opened this issue Sep 18, 2020 · 1 comment
Open

Allow simple text as input for upload_data_to_dataset #78

mhaas opened this issue Sep 18, 2020 · 1 comment

Comments

@mhaas
Copy link
Member

mhaas commented Sep 18, 2020

Right now, we only allow binary, which requires additional work compared to just opening a file or passing text.

@mhaas
Copy link
Member Author

mhaas commented Sep 22, 2020

This is actually not so easy to implement. The requests library strongly prefers that a binary stream (or data) is passed: https://requests.readthedocs.io/en/latest/user/advanced/#streaming-uploads

The naive solution is to read the entire data into memory and just convert it there. This will however require a lot of memory for e.g. a 5 GiB file, so I would rather not do that.

If we allow file handles in text (non-binary) mode, we have to create a wrapper which will decode utf-8 characters to bytes while also handling multi-byte characters. This SO post provides some insight: https://stackoverflow.com/questions/55889474/convert-io-stringio-to-io-bytesio

We can implement this ourselves, but it will not be straightforward to get the entire size of the byte string without processing the entire string. This may even be OK, as it is linear effort. If we do not have the size of the stream, then requests will switch to Chunk-Encoded and I am not sure if the Data Attribute Recommendation service supports this.

Another solution is to use the codecs.iterdecode function. This returns an iterable, which will again cause requests to use the Chunk-Encoded mode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant