Paligemma support for multi-image #33447

zucchini-nlp · 2024-09-12T10:44:50Z

What does this PR do?

We also transfer PaliGemma to tasks which take multiple images as input. NLVR2 is one such task, which asks one question about two images, and requires looking at both to give the correct answer. Other such tasks are standard short-video understanding tasks subsampled to 16 frames. In all these cases, we follow PaLI-3 and encode each image separately, then concatenate the image tokens without any special separator or embedding tokens. Thus, 16 frames at 224px resolution result in

Fixes #33113. Paligemma certain checkpoints are expected to support multiple images according to the arxiv paper. This PR adds support for that in our code. We expect users to pass in images in a batch if there are more than one images. For ex:

processor(text=[text1, text2], images=[[im1, im2], [im3]]], padding=True, return_tensros="pt")

HuggingFaceDocBuilderDev · 2024-09-12T11:04:46Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

amyeroberts

Thanks for adding!

One q and a handful of tiny nits

amyeroberts · 2024-09-12T11:04:06Z

tests/models/paligemma/test_modeling_paligemma.py

@@ -308,7 +308,7 @@ def test_save_load_low_cpu_mem_usage_no_safetensors(self):

 @slow
 @require_torch
-@require_read_token
+# @require_read_token


Should remove if not required

amyeroberts · 2024-09-12T11:04:14Z

tests/models/paligemma/test_modeling_paligemma.py

@@ -340,6 +340,32 @@ def test_small_model_integration_test(self):
            EXPECTED_DECODED_TEXT,
        )

+    @slow
+    # @require_read_token


yes, wanted to leave on general class level and remove from each test, seems reduntant otherwise

amyeroberts · 2024-09-12T11:05:05Z

tests/models/paligemma/test_modeling_paligemma.py

+        inputs = processor(text=prompt, images=[[snow_image, stop_sign_image]], return_tensors="pt")
+
+        output = model.generate(**inputs, max_new_tokens=20)
+        EXPECTED_DECODED_TEXT = "answer en Which of the two pictures shows a snowman, first or second?\nFirst"


it gets it wrong :(

Gets it wrong in your local env or am I misunderstanding smth? In general it was hard to ask specific enough questions, seems like multi-image can't answer open ended questions

Aren't these types of inputs/model combinations meh at enumeration/counting? Considering the short snippet from the paper, asking a combined question would have more chances of being consistent/successful on different envs, maybe "what is the main difference between these two images?"

Oke, we can change the test prompt, let me try with some variations. For my last open question What do these images have in common? I got a garbage answer

What do you think of the one I added just now? I followed the NLVR2 format where the dataset contains true/false questions only and added two prompts to make sure we don't generate by chance

sounds better for me!

src/transformers/models/paligemma/processing_paligemma.py

amyeroberts · 2024-09-12T11:12:57Z

src/transformers/models/paligemma/processing_paligemma.py

+        elif isinstance(images, list) and is_valid_image(images[0]):
+            images = [[image] for image in images]


For the processing - does this mean that [image_0, image_1] is interpreted as two images for a single sample in the minibatch, or a batch size of two?

yes, in that case it will be treated as a batch of two images, and this would need two prompts. This is quite inconsistent with LLaVAs, but Paligemma doesn't use image token in the input text, so we have no way of knowing how many images a user wants to use per prompt. Maybe I should add an example in the model doc page

Is there no way of making this so [image_0, image_1] would be equivalent to two images for a single prompt?

There's two reasons I think we should aim for this:

but Paligemma doesn't use image token in the input text, so we have no way of knowing how many images a user wants to use per prompt

This would provide structure such that the user can explicitly express how many images they want to use per sample

This would make the input structure consistent with other models e.g. llava and idefics2. This is important if we want to be able to cross-load checkpoints in pipelines

Yes, that is good for consistency and readability in general, but that would be a bit painful in terms of supporting old behavior. Prob we'd need to support old behavior for a long time, like until v5.0

Let me see how it goes in code and if doesn't mess much

Prob we'd need to support old behavior for a long time, like until v5.0

Why? PaliGemma was only very recently added so it's not a long standing model behaviour we need to preserve. We can do this is a deprecation cycle, imo.

Let me see how it goes in code and if doesn't mess much

OK!

Oh, btw, I just discovered that Idefics and similar models in transformers enforce batched and nested images as input, otherwise throwing an error. Is it something we should worry about?

@amyeroberts what do you say about this one? Pushed some updates

One thing that I didn't like is the bos-token which should be added before all text but after image tokens. I thought if we ask users to add special image tokens, we can ask them to add the "bos" also, which is what we have now

Looks good 👌

Co-authored-by: amyeroberts <[email protected]>

molbap

lgtm, just a couple remarks on variable naming especially in the tests :)

tests/models/paligemma/test_processing_paligemma.py

src/transformers/models/paligemma/processing_paligemma.py

Co-authored-by: Pablo Montalvo <[email protected]>

amyeroberts

Thanks for adding this support!

docs/source/en/model_doc/paligemma.md

Co-authored-by: amyeroberts <[email protected]>

* upadte * Update src/transformers/models/paligemma/processing_paligemma.py Co-authored-by: amyeroberts <[email protected]> * update docs * better example in tests * support image tokens * read token * Update tests/models/paligemma/test_processing_paligemma.py Co-authored-by: Pablo Montalvo <[email protected]> * nit: naming * Update docs/source/en/model_doc/paligemma.md Co-authored-by: amyeroberts <[email protected]> * conflicts after rebasing --------- Co-authored-by: amyeroberts <[email protected]> Co-authored-by: Pablo Montalvo <[email protected]>

upadte

5dddcaf

zucchini-nlp requested review from molbap and amyeroberts September 12, 2024 10:44

amyeroberts reviewed Sep 12, 2024

View reviewed changes

zucchini-nlp and others added 5 commits September 12, 2024 14:21

Update src/transformers/models/paligemma/processing_paligemma.py

83a4150

Co-authored-by: amyeroberts <[email protected]>

update docs

46aaf29

better example in tests

8bfab03

support image tokens

1a98d17

read token

2898095

molbap reviewed Sep 18, 2024

View reviewed changes

tests/models/paligemma/test_processing_paligemma.py Outdated Show resolved Hide resolved

src/transformers/models/paligemma/processing_paligemma.py Show resolved Hide resolved

zucchini-nlp and others added 2 commits September 18, 2024 15:44

Update tests/models/paligemma/test_processing_paligemma.py

629628a

Co-authored-by: Pablo Montalvo <[email protected]>

nit: naming

1daec28

zucchini-nlp requested a review from amyeroberts September 18, 2024 17:46

amyeroberts approved these changes Sep 21, 2024

View reviewed changes

docs/source/en/model_doc/paligemma.md Outdated Show resolved Hide resolved

docs/source/en/model_doc/paligemma.md Show resolved Hide resolved

docs/source/en/model_doc/paligemma.md Show resolved Hide resolved

zucchini-nlp and others added 3 commits September 23, 2024 13:26

Update docs/source/en/model_doc/paligemma.md

7e43be1

Co-authored-by: amyeroberts <[email protected]>

merge main

d7518c8

conflicts after rebasing

415ccbc

zucchini-nlp merged commit 3e039d3 into huggingface:main Sep 27, 2024
17 checks passed

zucchini-nlp mentioned this pull request Oct 28, 2024

Vision (Auto)Processor not accepting multiple images. #34453

Closed

4 tasks

tonywu71 mentioned this pull request Nov 4, 2024

Warning on adding <image> and <bos> tokens after updating to 0.3.3 illuin-tech/colpali#123

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Paligemma support for multi-image #33447

Paligemma support for multi-image #33447

zucchini-nlp commented Sep 12, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 12, 2024

amyeroberts left a comment

amyeroberts Sep 12, 2024

amyeroberts Sep 12, 2024

zucchini-nlp Sep 12, 2024

amyeroberts Sep 12, 2024

zucchini-nlp Sep 12, 2024

molbap Sep 16, 2024

zucchini-nlp Sep 17, 2024

zucchini-nlp Sep 17, 2024

molbap Sep 18, 2024

amyeroberts Sep 12, 2024

zucchini-nlp Sep 12, 2024

amyeroberts Sep 17, 2024

zucchini-nlp Sep 17, 2024

amyeroberts Sep 17, 2024

zucchini-nlp Sep 18, 2024

zucchini-nlp Sep 18, 2024

amyeroberts Sep 21, 2024

molbap left a comment

amyeroberts left a comment

		elif isinstance(images, list) and is_valid_image(images[0]):
		images = [[image] for image in images]

Paligemma support for multi-image #33447

Paligemma support for multi-image #33447

Conversation

zucchini-nlp commented Sep 12, 2024 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Sep 12, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

molbap left a comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

zucchini-nlp commented Sep 12, 2024 •

edited

Loading