Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for bytes formatted images to submit_image and submit_gallery #1891

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

redowul
Copy link

@redowul redowul commented Jul 26, 2022

Feature Summary and Justification

This feature enables a user to upload an image or images each represented as a Bytes Object to Reddit through the submit_image and submit_gallery functions.

For context, my own use-case involves backblaze.com, particularly this function of their API.
To summarize, a successful request to the backblaze API returns images as bytes objects. After looking into PRAW documentation I discovered that I could only submit an image through it by first downloading and saving the file before providing each of the modified functions the filepath(s) of any images I wanted to submit. Given my use case, this would mean I would need my machine to download, save, upload, and then delete any number of images on my system each time I made use of those functions.

These changes mean an object can be pulled in from a foreign API and passed directly to PRAW for uploading. It skips the awkward saving and deleting steps which should save processing time when scaled.


The "image_path" input parameter for both functions has been changed to "image". It still accepts image paths--it's only the name of the input parameter that has changed so as to more clearly indicate that Bytes Objects now work too. In the case of submit_gallery, the input can be mixed. So there could be any number of image paths and Bytes Objects submitted to the same gallery.

References

redowul and others added 3 commits July 26, 2022 10:36
Files modified:
- submit_image
- submit_gallery
- _validate_gallery
- _upload_media
- _read_and_post_media

Added new import:
- from io import BytesIO

Previously, to submit an image to Reddit a filepath to an image stored on the local machine needed to be provided to PRAW to pass through to the Reddit API. This fork changes this behavior by implementing support for bytes objects through the io library. What this means in practical terms: in the original code, the filepath is referenced to retrieve the image file and convert it to bytes format anyway. Now the bytes of an image can be directly passed through while still retaining the original filepath functionality. This means an image retrieved from a server does not need to be downloaded before it can be uploaded: it's String implementation can be passed along to Reddit instead.
@redowul
Copy link
Author

redowul commented Jul 26, 2022

Majority of the failures appear to be caused by:

from PIL import Image
ModuleNotFoundError: No module named 'PIL'

Is the import of new modules permitted? Or does the addition of this module spell death for this PR?

@LilSpazJoekp
Copy link
Member

LilSpazJoekp commented Jul 26, 2022

Let's not add a new dependency as BytesIO would be sufficient.

I'll have to take a deeper look at this be just skimming though it, it seems that there is going to be breaking changes. These will need to have a deprecation period. You can tell that you have breaking changes if the existing tests fail without any changes.

ETA: this is a very welcome change and I'd love to see this added.

@redowul
Copy link
Author

redowul commented Jul 26, 2022

Alright, sounds great. Would you like me to make changes and resubmit? Or are you going to make the corrections yourself? I haven't contributed to anything open-source before so I'm unsure of guidelines surrounding this sort of thing.

@LilSpazJoekp
Copy link
Member

You got a great start! There is a few options here:

  • replace the existing image path parameter (like you've done)
    • this requires a deprecation
  • add a new parameter
  • make the existing parameter accept a file like object as well as a path (this isn't the most preferable option as it can cause confusion with path in the parameter name.

The preferable option is to deprecate the existing parameter, rename it, and have it accept both. This shouldn't be too difficult. Take a look at the existing deprecations for some examples of how it's been done.

This should get you started. Let me know if you need any help!

@bboe
Copy link
Member

bboe commented Jul 26, 2022

The preferable option is to deprecate the existing parameter, rename it, and have it accept both. This shouldn't be too difficult. Take a look at the existing deprecations for some examples of how it's been done.

I think that is common for many libraries, but have a distaste for the variability of the type. I'd prefer to have mutually exclusive keyword arguments media_path, and media_object and use a runtime exception (or whatever we commonly use for that) if both or neither are provided. Thoughts?

@LilSpazJoekp
Copy link
Member

but have a distaste for the variability of the type

That is a good point.

I do like the idea of adding a new parameter to the media methods (including inline media) for file like objects. My only suggestion is to do _fp instead of _object. _object feels odd and makes me think there is a PRAW type that would need to be used.

@redowul
Copy link
Author

redowul commented Jul 26, 2022

I'd prefer to have mutually exclusive keyword arguments media_path, and media_object and use a runtime exception (or whatever we commonly use for that) if both or neither are provided. Thoughts?

The original version I wrote actually supported that, though not exactly as you've described. I decided to merge both parameters into one because I wasn't sure what the preferred way of handling the issue of both types being submitted was.

So, right now the checklist looks like:

  • Use BytesIO as the exclusive dependency to handle Bytes Objects
  • Implement mutually exclusive keyword arguments for media_path and media_fp (not media_object?)
  • Add a runtime exception if neither are provided for submit_gallery or if both are provided for submit_image (which means they are now optional parameters for both functions)

Do I have all that right?

@LilSpazJoekp
Copy link
Member

For the most part yes.

Use BytesIO as the exclusive dependency to handle Bytes Objects

I think BinaryIO would be more appropriate since this is for media objects.

Implement mutually exclusive keyword arguments for media_path and media_fp (not media_object?)

This is still up for discussion.

Add a runtime exception if neither are provided for submit_gallery or if both are provided for submit_image (which means they are now optional parameters for both functions)

We need to consider the implications of adding new arguments with the methods that have positional arguments deprecated. i.e., if they become optional they need to be made a keyword-only argument and that might mess with another deprecation. I will help with this point as the code for _deprecate_args decorator is a bit complex.

Do I have all that right?

For the most part, yes. I think it would be best to also include the inline media, subreddit stylesheet images, and widget images as well (basically anything you can upload an image for).

@bboe
Copy link
Member

bboe commented Jul 26, 2022

media_fp works for me, as long as we're consistent (I'm not sure if file objects are passed anywhere else in praw).

@redowul
Copy link
Author

redowul commented Jul 27, 2022

Made the changes discussed here, though so far only to submit_image and submit_gallery. Both of those work, but because the changes were only made there several other functions now fail because they lack the new parameter (media_fp.) Will need to go through and adjust any functions lacking the parameter so they include it.

I removed the PIL import but couldn't figure out how to make use of BinaryIO (or even how to use it, Google was unhelpful.) Instead I decided to make use of the already included dependency of BytesIO, which proved useful enough.

With that out of the way, the core of the new implementation works as follows:

            magic_number = [int (aByte) for aByte in media_fp[:8]] # gets the format indicator

            file_headers = { 
                tuple([int (aByte) for aByte in bytes([0x89, 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A])]): 'png',
                tuple([int (aByte) for aByte in bytes([0x6D, 0x6F, 0x6F, 0x76])]): 'mov',
                tuple([int (aByte) for aByte in bytes([0x66, 0x74, 0x79, 0x70, 0x69, 0x73, 0x6F, 0x6D])]): 'mp4',
                tuple([int (aByte) for aByte in bytes([0xFF, 0xD8, 0xFF, 0xE0])]): 'jpg',  
                tuple([int (aByte) for aByte in bytes([0xFF, 0xD8, 0xFF, 0xE0, 0x00, 0x10, 0x4A, 0x46])]): 'jpeg', 
                tuple([int (aByte) for aByte in bytes([0x47, 0x49, 0x46, 0x38, 0x37, 0x61])]): 'gif',   
            }
            for size in range(4, 10, 2): # size will equal 4, 6, 8
                file_extension = file_headers.get(tuple(magic_number[:size])) 
                if file_extension is not None: 
                    mime_type = mime_type.get(file_extension, "image/jpeg") # default to JPEG
                    file_name = mime_type.split("/")[0] + "." + mime_type.split("/")[1]
                    break 

Every file format has a unique number of bytes (called a magic number since I guess the 1960s? I just learned about this while putting this together) in its header to indicate what format the file is in. This is the same every time for a format of a given type. PNG files for example contain the following decimal values (converted from hexidecimal): 137 80 78 71 13 10 26 10. Incoming byte values are converted to a list of integers and used as a key for the file_headers dictionary to pull out the file_extension value.

Otherwise, media_path has been reverted to its original functionality. Overall I think I'm going to reach a point where I plateau due to not understanding the deeper workings of PRAW--that's probably when I'll need to hand it off to Joel for final touchups--since you offered.

Thoughts?

@bboe
Copy link
Member

bboe commented Jul 27, 2022

I don't understand the point of the conversion data inside of PRAW. Anything needed to convert from the PIL image format to BytesIO objects should happen on the caller-side. It looks like you can "save" to a BytesIO file in this way: https://stackoverflow.com/questions/646286/how-to-write-png-image-to-string-with-the-pil

@redowul
Copy link
Author

redowul commented Jul 27, 2022

I don't understand the point of the conversion data inside of PRAW. Anything needed to convert from the PIL image format to BytesIO objects should happen on the caller-side. It looks like you can "save" to a BytesIO file in this way:

The idea was to avoid converting as much as possible in this case. For example, the backblaze api returns a regular encoded (hexadecimal) String. I reasoned that this would be ubiquitous among similar services and therefore the simplest type a web browser could return, which I think means it has the best chance of being the format an end-user would make use of when passing to PRAW. Otherwise the file is probably stored on their local machine, so why wouldn't they just supply the filepath at that point?

( Changing it to require an image converted from PIL (or any other library) would not be difficult; it's just a matter of deciding which input format is preferable. )

@bboe
Copy link
Member

bboe commented Jul 28, 2022

Changing it to require an image converted from PIL (or any other library) would not be difficult; it's just a matter of deciding which input format is preferable.

For the media_fp it should be an object that has a read method just like File objects and when read it returns the bytes just as a file opened in rb mode. Any conversion to get into that format should happen on the user-side code not inside of PRAW.

There's an example here of saving a PIL file to a BytesIO object: https://stackoverflow.com/a/646297

You'll likely need to seek the BytesIO object back to position 0 after saving: output.seek(0).

@redowul
Copy link
Author

redowul commented Jul 30, 2022

So upon further reviewing your last post, I'm a little confused @bboe. The link you provided talks about converting an image to a bytes string, which is already the format the current code accepts--there's no conversion going on inside of PRAW in my latest update to this PR. The current flow would be for someone to utilize the code from your Stack Overflow link, reproduced here:

import io

with io.BytesIO() as output:
    image.save(output, format="GIF")
    contents = output.getvalue()

outside of PRAW, and then they would submit the contents variable (a string) to submit_image or submit_gallery using the new media_fp parameter, like so:

reddit.subreddit("test").submit_image("title", image_fp=contents)

(I have already verified that this works using the code in its present state. I've tested it and it posts to Reddit as expected.)
It almost sounds like it already does what you're talking about--do you mind elaborating a bit?

Thanks you!

@redowul
Copy link
Author

redowul commented Aug 1, 2022

How do I run tests on PRAW without pushing new code? This is a little too slow and tedious. I'd rather just do all the testing internally before pushing, since it would be both faster and look better in the commit history...

@LilSpazJoekp
Copy link
Member

Should be able to do so by running pytest locally in the repository root.

@redowul
Copy link
Author

redowul commented Aug 3, 2022

Okay, after some effort I've managed to ensure all the unit tests are passing. A brief summary of what that means:

  • media_path and media_fp are both now considered optional parameters in the _upload_media() function.
    Prior to my PR when media_path was the only input parameter, this code handled null inputs by assigning a default value to media_path. I just reused that to handle when both media_path and media_fp are null. (This means functions like submit_video which also make use of the _upload_media() function can remain unchanged while still retaining their funtionality.)

  • I added a test called "test_submit_gallery__invalid_image_fp", which can be viewed here:

     def test_submit_gallery__invalid_image_fp(self):
        subreddit = Subreddit(self.reddit, display_name="name")
    
        message = (
            "'image_fp' dictionary value at index 0 contains an invalid bytes object."
        )
        with pytest.raises(TypeError) as excinfo:
            subreddit.submit_gallery("Cool title", [{"image_fp": "invalid_image"}])
        assert str(excinfo.value) == message
    
        message = "media_fp does not represent an accepted file format (png, mov, mp4, jpg, jpeg, gif.)"
        encoded_string = "invalid_image".encode()
        with pytest.raises(TypeError) as excinfo:
            subreddit.submit_gallery("Cool title", [{"image_fp": encoded_string}])
        assert str(excinfo.value) == message
    
        message = "'NoneType' object has no attribute 'post'"
        png_image_header = [0x89, 0x50, 0x4E, 0x47, 0x0D, 0x0A, 0x1A, 0x0A]
        invalid_png_image = bytes(bytearray(png_image_header) + encoded_string)
    
        with pytest.raises(AttributeError) as excinfo:
            subreddit.submit_gallery("Cool title", [{"image_fp": invalid_png_image}])
        assert str(excinfo.value) == message
    

This handles invalid objects, invalid file headers, and corrupted images being passed to the media_fp parameter. The code I've added should catch all of these and output (hopefully) descriptive error messages.

Let me know what you think. Hopefully it's getting there.

@bboe
Copy link
Member

bboe commented Aug 10, 2022

I'm taking a look at this currently.

@bboe
Copy link
Member

bboe commented Aug 10, 2022

Okay so there are a few things that are odd with our code that I think we should change prior to introducing your change.

  1. _read_and_post_media should take a file pointer rather than try to open the file. This change will enable passing file-like objects to work more transparently.

  2. _upload_media should not do any expected mime type checking. Instead it should only take the following arguments:

    • file_name (the name of the file, not the path)
    • fp (the File-like object)
    • mime_type (the mime type that will be sent)
    • upload_type (same as the current argument)
  3. The expected mime type checking, and opening of the file should happen in another function like _prepare_media_upload which will only be used when a filename is provided. When a fileobject is provided, then it is necessary that the caller provides the mime-type.

Thoughts?

@redowul
Copy link
Author

redowul commented Aug 11, 2022

  1. I believe that's already what it's doing, assuming this is the slice of code you're referencing (which explicitly includes the keyword "open"):
def _read_and_post_media(self, media_path, media_fp, upload_url, upload_data):
        response = None
        if media_path is not None and media_fp is None:
            with open(media_path, "rb") as media:
                response = self._reddit._core._requestor._http.post(
                    upload_url, data=upload_data, files={"file": media}
                )
        elif media_path is None and media_fp is not None:
            response = self._reddit._core._requestor._http.post(
                upload_url, data=upload_data, files={"file": BytesIO(media_fp)}
            )
        return response

Here it is media_path that is being opened before being sent along in the POST request, not media_fp. This functionality was not introduced by me however. It was already present. I may be misunderstanding what you mean though.

  1. That's reasonable, it would just put the impetus on the end-user to provide accurate data prior to their request. I believe sending the wrong mime-type will trigger a RequestException which is currently being passed over:
try:
            upload_response = self._reddit.post(url, data=img_data)
            upload_lease = upload_response["args"]
            upload_url = f"https:{upload_lease['action']}"
            upload_data = {
                item["name"]: item["value"] for item in upload_lease["fields"]
            }
        except RequestException:
            pass

A new exception will likely be triggered in the following _read_and_post_media function. Catching it and providing an error message to the end user should be enough.

Finally, I think that since mime_type as an input parameter is only relevant as a supplement to media_fp (since the mime type of media_path is derived from the path itself) it makes more sense to convert media_fp into a dictionary like so: media_fp: Optional[Dict[bytes, str]] where a bytes key refers to a file pointer and a mime_type key refers to the mime type. This could be made clear to an end user by updating the documentation at the beginning of the submit_image and submit_gallery functions to provide clear examples of how the parameter should be provided.

  1. To be clear, this would be for the mime type checking of media_path, correct? Because the preceding mime type check functionality is so small that it might not warrant an additional function, and that's all that would remain after the stripping out of the type checking for bytes objects. I'm not sure if this part is necessary on account of that.

@LilSpazJoekp LilSpazJoekp changed the title Added support for bytes formatted images to submit_image and submit_gallery Add support for bytes formatted images to submit_image and submit_gallery Aug 11, 2022
@redowul
Copy link
Author

redowul commented Aug 11, 2022

So after reviewing the code I ultimately decided to throw out my response to point 2 and actually just go with what you recommended @bboe, it ended up being the best way forward in terms of readability and code complexity (or lack thereof!) I've removed the mime checking and replaced it with a user-supplied mime_type string.

I've added the following to try and ensure parity with media_path:

if isinstance(media_fp, bytes):
                mime_type = mime_types.get(
                    mime_type.partition("/")[1], "image/jpeg"
                )  # default to JPEG

The idea being to ensure a mime_type is always supplied which is a vestige of the old code.

Currently I'm adding little scraps of documentation inside of submit_image, submit_gallery, and upload_media so that mime_type has a documented reference for anyone who looks into the code in the future.

@LilSpazJoekp
Copy link
Member

  1. I believe that's already what it's doing, assuming this is the slice of code you're referencing (which explicitly includes the keyword "open"):
def _read_and_post_media(self, media_path, media_fp, upload_url, upload_data):
        response = None
        if media_path is not None and media_fp is None:
            with open(media_path, "rb") as media:
                response = self._reddit._core._requestor._http.post(
                    upload_url, data=upload_data, files={"file": media}
                )
        elif media_path is None and media_fp is not None:
            response = self._reddit._core._requestor._http.post(
                upload_url, data=upload_data, files={"file": BytesIO(media_fp)}
            )
        return response

Here it is media_path that is being opened before being sent along in the POST request, not media_fp. This functionality was not introduced by me however. It was already present. I may be misunderstanding what you mean though.

What @bboe is meaning is to change the existing functionality in PRAW before your changes:

Okay so there are a few things that are odd with our code that I think we should change prior to introducing your change.

To comment on what was said here, I agree with the first point. The second point, I think _upload_media could have the following signature:

  • file_name (the name of the file, not the path)
  • fp (the File-like object)
  • upload_type (same as the current argument)

I think the mime type could be removed here because it could be derived from the file name (which is what PRAW currently does). For the third point, I do think there should be a separate function like what @bboe suggested, however, it should require file_name if a BytesIO like object is passed. If a file pointer (i.e., BufferedReader) is passed, fp should have a name attribute containing the file name.

@LilSpazJoekp
Copy link
Member

I also wanted to add that this PR should also include the methods for uploading images for inline media, emojis, widgets, and the stylesheet.

@redowul
Copy link
Author

redowul commented Aug 12, 2022

  • file_name (the name of the file, not the path)
  • fp (the File-like object)
  • upload_type (same as the current argument)

So would media_path still exist as a parameter? E.g., would the final function look like this:

def _upload_media(
        self,
        *,
        expected_mime_prefix: Optional[str] = None,
        path: Optional[str] = None,
        fp: Optional[bytes] = None,
        file_name: Optional[str] = None,
        upload_type: str = "link",
    ):

And to comment on the inline media/emoji/widget/stylesheet part, I'd like to nail down exactly how this part should work first. It will make copying the functionality over easier once the existing code in the PR has been finalized (submit_image, submit_gallery, _upload_media.)

@redowul
Copy link
Author

redowul commented Aug 12, 2022

To comment on this part:

it should require file_name if a BytesIO like object is passed. If a file pointer (i.e., BufferedReader) is passed, fp should have a name attribute containing the file name.

Could you provide an example of how that would differ from a BytesIO like object? Right now fp would just be an encoded String.

@LilSpazJoekp
Copy link
Member

A BufferedReader is what is returned from with open("file_name", "r") as fp: and fp will have a name attribute with the value of file_name.

@LilSpazJoekp LilSpazJoekp added the Feature New feature or request label Aug 28, 2022
@github-actions
Copy link

This PR is stale because it has been open for 20 days with no activity. Remove the Stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale Issue or pull request has been inactive for 20 days label Sep 18, 2022
@redowul
Copy link
Author

redowul commented Sep 26, 2022

Going to continue working on this soon. Been otherwise occupied but I haven't abandoned this.

@bboe bboe removed the Stale Issue or pull request has been inactive for 20 days label Sep 26, 2022
@github-actions
Copy link

This PR is stale because it has been open for 20 days with no activity. Remove the Stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale Issue or pull request has been inactive for 20 days label Oct 16, 2022
@LilSpazJoekp LilSpazJoekp removed the Stale Issue or pull request has been inactive for 20 days label Oct 23, 2022
@LilSpazJoekp
Copy link
Member

Just checking if you're still wanting this to be merged in? Is there anything we could help with?

@redowul
Copy link
Author

redowul commented Oct 23, 2022 via email

@github-actions
Copy link

This PR is stale because it has been open for 30 days with no activity. Remove the Stale label or comment or this will be closed in 30 days.

@github-actions github-actions bot added the Stale Issue or pull request has been inactive for 20 days label Nov 22, 2022
@LilSpazJoekp LilSpazJoekp removed the Stale Issue or pull request has been inactive for 20 days label Nov 24, 2022
@LilSpazJoekp LilSpazJoekp added the Discussion Open for discussion label Dec 22, 2022
@gergovari
Copy link

Using a BufferedReader for upload would be a welcome change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion Open for discussion Feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants