-
Notifications
You must be signed in to change notification settings - Fork 26.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Paligemma support for multi-image #33447
Paligemma support for multi-image #33447
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding!
One q and a handful of tiny nits
@@ -308,7 +308,7 @@ def test_save_load_low_cpu_mem_usage_no_safetensors(self): | |||
|
|||
@slow | |||
@require_torch | |||
@require_read_token | |||
# @require_read_token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should remove if not required
@@ -340,6 +340,32 @@ def test_small_model_integration_test(self): | |||
EXPECTED_DECODED_TEXT, | |||
) | |||
|
|||
@slow | |||
# @require_read_token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, wanted to leave on general class level and remove from each test, seems reduntant otherwise
inputs = processor(text=prompt, images=[[snow_image, stop_sign_image]], return_tensors="pt") | ||
|
||
output = model.generate(**inputs, max_new_tokens=20) | ||
EXPECTED_DECODED_TEXT = "answer en Which of the two pictures shows a snowman, first or second?\nFirst" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it gets it wrong :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gets it wrong in your local env or am I misunderstanding smth? In general it was hard to ask specific enough questions, seems like multi-image can't answer open ended questions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't these types of inputs/model combinations meh at enumeration/counting? Considering the short snippet from the paper, asking a combined question would have more chances of being consistent/successful on different envs, maybe "what is the main difference between these two images?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oke, we can change the test prompt, let me try with some variations. For my last open question What do these images have in common?
I got a garbage answer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of the one I added just now? I followed the NLVR2 format where the dataset contains true/false questions only and added two prompts to make sure we don't generate by chance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds better for me!
elif isinstance(images, list) and is_valid_image(images[0]): | ||
images = [[image] for image in images] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the processing - does this mean that [image_0, image_1]
is interpreted as two images for a single sample in the minibatch, or a batch size of two?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, in that case it will be treated as a batch of two images, and this would need two prompts. This is quite inconsistent with LLaVAs, but Paligemma doesn't use image token in the input text, so we have no way of knowing how many images a user wants to use per prompt. Maybe I should add an example in the model doc page
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there no way of making this so [image_0, image_1]
would be equivalent to two images for a single prompt?
There's two reasons I think we should aim for this:
but Paligemma doesn't use image token in the input text, so we have no way of knowing how many images a user wants to use per prompt
-
This would provide structure such that the user can explicitly express how many images they want to use per sample
-
This would make the input structure consistent with other models e.g. llava and idefics2. This is important if we want to be able to cross-load checkpoints in pipelines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is good for consistency and readability in general, but that would be a bit painful in terms of supporting old behavior. Prob we'd need to support old behavior for a long time, like until v5.0
Let me see how it goes in code and if doesn't mess much
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prob we'd need to support old behavior for a long time, like until v5.0
Why? PaliGemma was only very recently added so it's not a long standing model behaviour we need to preserve. We can do this is a deprecation cycle, imo.
Let me see how it goes in code and if doesn't mess much
OK!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, btw, I just discovered that Idefics and similar models in transformers enforce batched and nested images as input, otherwise throwing an error. Is it something we should worry about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amyeroberts what do you say about this one? Pushed some updates
One thing that I didn't like is the bos-token which should be added before all text but after image tokens. I thought if we ask users to add special image tokens, we can ask them to add the "bos" also, which is what we have now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good 👌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, just a couple remarks on variable naming especially in the tests :)
Co-authored-by: Pablo Montalvo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this support!
Co-authored-by: amyeroberts <[email protected]>
* upadte * Update src/transformers/models/paligemma/processing_paligemma.py Co-authored-by: amyeroberts <[email protected]> * update docs * better example in tests * support image tokens * read token * Update tests/models/paligemma/test_processing_paligemma.py Co-authored-by: Pablo Montalvo <[email protected]> * nit: naming * Update docs/source/en/model_doc/paligemma.md Co-authored-by: amyeroberts <[email protected]> * conflicts after rebasing --------- Co-authored-by: amyeroberts <[email protected]> Co-authored-by: Pablo Montalvo <[email protected]>
* upadte * Update src/transformers/models/paligemma/processing_paligemma.py Co-authored-by: amyeroberts <[email protected]> * update docs * better example in tests * support image tokens * read token * Update tests/models/paligemma/test_processing_paligemma.py Co-authored-by: Pablo Montalvo <[email protected]> * nit: naming * Update docs/source/en/model_doc/paligemma.md Co-authored-by: amyeroberts <[email protected]> * conflicts after rebasing --------- Co-authored-by: amyeroberts <[email protected]> Co-authored-by: Pablo Montalvo <[email protected]>
What does this PR do?
Fixes #33113. Paligemma certain checkpoints are expected to support multiple images according to the arxiv paper. This PR adds support for that in our code. We expect users to pass in images in a batch if there are more than one images. For ex:
processor(text=[text1, text2], images=[[im1, im2], [im3]]], padding=True, return_tensros="pt")