Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paligemma support for multi-image #33447

Merged
merged 11 commits into from
Sep 27, 2024

Conversation

zucchini-nlp
Copy link
Member

@zucchini-nlp zucchini-nlp commented Sep 12, 2024

What does this PR do?

We also transfer PaliGemma to tasks which take multiple images as input. NLVR2 is one such task, which asks one question about two images, and requires looking at both to give the correct answer. Other such tasks are standard short-video understanding tasks subsampled to 16 frames. In all these cases, we follow PaLI-3 and encode each image separately, then concatenate the image tokens without any special separator or embedding tokens. Thus, 16 frames at 224px resolution result in

Fixes #33113. Paligemma certain checkpoints are expected to support multiple images according to the arxiv paper. This PR adds support for that in our code. We expect users to pass in images in a batch if there are more than one images. For ex:

processor(text=[text1, text2], images=[[im1, im2], [im3]]], padding=True, return_tensros="pt")

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding!

One q and a handful of tiny nits

@@ -308,7 +308,7 @@ def test_save_load_low_cpu_mem_usage_no_safetensors(self):

@slow
@require_torch
@require_read_token
# @require_read_token
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should remove if not required

@@ -340,6 +340,32 @@ def test_small_model_integration_test(self):
EXPECTED_DECODED_TEXT,
)

@slow
# @require_read_token
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, wanted to leave on general class level and remove from each test, seems reduntant otherwise

inputs = processor(text=prompt, images=[[snow_image, stop_sign_image]], return_tensors="pt")

output = model.generate(**inputs, max_new_tokens=20)
EXPECTED_DECODED_TEXT = "answer en Which of the two pictures shows a snowman, first or second?\nFirst"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it gets it wrong :(

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gets it wrong in your local env or am I misunderstanding smth? In general it was hard to ask specific enough questions, seems like multi-image can't answer open ended questions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't these types of inputs/model combinations meh at enumeration/counting? Considering the short snippet from the paper, asking a combined question would have more chances of being consistent/successful on different envs, maybe "what is the main difference between these two images?"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oke, we can change the test prompt, let me try with some variations. For my last open question What do these images have in common? I got a garbage answer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of the one I added just now? I followed the NLVR2 format where the dataset contains true/false questions only and added two prompts to make sure we don't generate by chance

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds better for me!

src/transformers/models/paligemma/processing_paligemma.py Outdated Show resolved Hide resolved
Comment on lines 268 to 269
elif isinstance(images, list) and is_valid_image(images[0]):
images = [[image] for image in images]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the processing - does this mean that [image_0, image_1] is interpreted as two images for a single sample in the minibatch, or a batch size of two?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, in that case it will be treated as a batch of two images, and this would need two prompts. This is quite inconsistent with LLaVAs, but Paligemma doesn't use image token in the input text, so we have no way of knowing how many images a user wants to use per prompt. Maybe I should add an example in the model doc page

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no way of making this so [image_0, image_1] would be equivalent to two images for a single prompt?

There's two reasons I think we should aim for this:

but Paligemma doesn't use image token in the input text, so we have no way of knowing how many images a user wants to use per prompt

  1. This would provide structure such that the user can explicitly express how many images they want to use per sample

  2. This would make the input structure consistent with other models e.g. llava and idefics2. This is important if we want to be able to cross-load checkpoints in pipelines

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is good for consistency and readability in general, but that would be a bit painful in terms of supporting old behavior. Prob we'd need to support old behavior for a long time, like until v5.0

Let me see how it goes in code and if doesn't mess much

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prob we'd need to support old behavior for a long time, like until v5.0

Why? PaliGemma was only very recently added so it's not a long standing model behaviour we need to preserve. We can do this is a deprecation cycle, imo.

Let me see how it goes in code and if doesn't mess much

OK!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, btw, I just discovered that Idefics and similar models in transformers enforce batched and nested images as input, otherwise throwing an error. Is it something we should worry about?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amyeroberts what do you say about this one? Pushed some updates

One thing that I didn't like is the bos-token which should be added before all text but after image tokens. I thought if we ask users to add special image tokens, we can ask them to add the "bos" also, which is what we have now

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👌

Copy link
Contributor

@molbap molbap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just a couple remarks on variable naming especially in the tests :)

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this support!

docs/source/en/model_doc/paligemma.md Outdated Show resolved Hide resolved
docs/source/en/model_doc/paligemma.md Show resolved Hide resolved
docs/source/en/model_doc/paligemma.md Show resolved Hide resolved
@zucchini-nlp zucchini-nlp merged commit 3e039d3 into huggingface:main Sep 27, 2024
17 checks passed
BenjaminBossan pushed a commit to BenjaminBossan/transformers that referenced this pull request Sep 30, 2024
* upadte

* Update src/transformers/models/paligemma/processing_paligemma.py

Co-authored-by: amyeroberts <[email protected]>

* update docs

* better example in tests

* support image tokens

* read token

* Update tests/models/paligemma/test_processing_paligemma.py

Co-authored-by: Pablo Montalvo <[email protected]>

* nit: naming

* Update docs/source/en/model_doc/paligemma.md

Co-authored-by: amyeroberts <[email protected]>

* conflicts after rebasing

---------

Co-authored-by: amyeroberts <[email protected]>
Co-authored-by: Pablo Montalvo <[email protected]>
amyeroberts added a commit to amyeroberts/transformers that referenced this pull request Oct 2, 2024
* upadte

* Update src/transformers/models/paligemma/processing_paligemma.py

Co-authored-by: amyeroberts <[email protected]>

* update docs

* better example in tests

* support image tokens

* read token

* Update tests/models/paligemma/test_processing_paligemma.py

Co-authored-by: Pablo Montalvo <[email protected]>

* nit: naming

* Update docs/source/en/model_doc/paligemma.md

Co-authored-by: amyeroberts <[email protected]>

* conflicts after rebasing

---------

Co-authored-by: amyeroberts <[email protected]>
Co-authored-by: Pablo Montalvo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add multi image prompts to multimodal LLMs that support it (PaliGemma)
4 participants