Uniformize kwargs for LLaVa processor and update docs #32858

yonigozlan · 2024-08-16T20:38:44Z

What does this PR do?

Adds uniformized processors kwargs following #31911 for LLaVa

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

HuggingFaceDocBuilderDev · 2024-08-16T21:00:00Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

src/transformers/models/llava/processing_llava.py

zucchini-nlp

Thanks for working on this, LGTM!

Left one nit with docstring and we'd need to swap processor args order in test_modeling_llava.py also

src/transformers/models/llava/processing_llava.py

leloykun · 2024-08-19T08:00:23Z

src/transformers/models/llava/processing_llava.py

+        # check if images and text inputs are reversed for BC
+        if (
+            text is not None
+            and not isinstance(text[0], str)


text could also just be an str

The check should still work in that case, except if we have an empty string. I will try to think of something cleaner

yonigozlan

Hey @zucchini-nlp @leloykun ! Here is a revised version of backward compatibility handling for reversed images and text inputs. the main addition is the _check_reversed_images_text function in processing_utils.py which should better detect if inputs need to be reversed, and can be used by all processors where images and text inputs have been swapped.

yonigozlan · 2024-08-19T23:38:27Z

src/transformers/processing_utils.py

@@ -120,7 +118,6 @@ class TextKwargs(TypedDict, total=False):
    return_offsets_mapping: Optional[bool]
    return_length: Optional[bool]
    verbose: Optional[bool]
-    padding_side: Optional[str]


I removed padding_side in this PR as it should be one of the first uniformization PR to get merged. This shouldn't break anything (see #32544 (comment))

I believe it will break things in the sense that people who were previously passing in padding_side to the processor (even if it had no effect) would now experience an error, I think?

In the case of llava at least, even with padding_side in the base TextKwargs, if the user pass in padding_side to a processor call, there will be an error as the tokenizer inherit from PreTrainedTokenizerFast, and the call function _batch_encode_plus doesn't take in padding_side nor **kwargs :

transformers/src/transformers/tokenization_utils_fast.py

Lines 491 to 512 in 27903de

def _batch_encode_plus(

self,

batch_text_or_text_pairs: Union[

List[TextInput], List[TextInputPair], List[PreTokenizedInput], List[PreTokenizedInputPair]

],

add_special_tokens: bool = True,

padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,

truncation_strategy: TruncationStrategy = TruncationStrategy.DO_NOT_TRUNCATE,

max_length: Optional[int] = None,

stride: int = 0,

is_split_into_words: bool = False,

pad_to_multiple_of: Optional[int] = None,

return_tensors: Optional[str] = None,

return_token_type_ids: Optional[bool] = None,

return_attention_mask: Optional[bool] = None,

return_overflowing_tokens: bool = False,

return_special_tokens_mask: bool = False,

return_offsets_mapping: bool = False,

return_length: bool = False,

verbose: bool = True,

split_special_tokens: bool = False,

) -> BatchEncoding:

On that point, if we don't accept kwargs for this function, shouldn't we restrict the kwargs passed in to the processor's modality processors classes? I guess this would require a bit of refactoring of the base processor class, but right now, if a user pass in a kwarg not supported in any ModalityKwarg class, it will be passed as a "common" kwarg to all of the modality processors. If the call functions of the modality processor accepts kwargs, the kwarg will just be ignored, but in the case of this _batch_encode_plus function (there might be others) there will be an error.

OK, I see. More generally, we might want to think about being able to control the padding side when calling the tokenizer, as this is something we'd like to be able to handle for e.g. llava and more generally, but this is something for a wider discussion and future PRs cc @zucchini-nlp

In this case, if padding_side can't be accepted by fast tokenizers then yes, I think we should remove it here. I'd do this in a separate PR as it might affect other processors which used TextKwargs and so it would be good to introduce as an atomic change we can easily test, isolate and revert if needed

Seems like we are welcome to acceptring padding_side as kwargs in tokenozers (#30447) but I couldn't find a PR for that. I totally agree that it's a nice feature to have and looks easy to implement. I can work in it next week so that we son't have to drop the kwarg in processors. The only thing is that it will be kinda BC breaking, because earlier we ignored the kwarg and now we'll use it to pad on the correct side. WDYT? @amyeroberts

@zucchini-nlp Sounds good to me! Even if it's breaking, I think it's breaking in the right way: correcting a surprising behaviour

Added back padding_side to this PR. Waiting on padding_side to be accepted as kwargs in tokenizers before merging this PR then. Could you ping this PR or me when you open a PR @zucchini-nlp ? Thanks!

Here is it #33385

yonigozlan · 2024-08-19T23:50:19Z

src/transformers/processing_utils.py

@@ -830,6 +827,60 @@ class MyProcessingKwargs(ProcessingKwargs, CommonKwargs, TextKwargs, ImagesKwarg
            output_kwargs[modality].update(output_kwargs["common_kwargs"])
        return output_kwargs

+    def _check_reversed_images_text(self, images, text):


This function might be too problem-specific to be included in the base processor class, but since several VLMs are affected by this issue and it’s only a temporary solution pending deprecation, I thought it might make sense to include it here. It seems more practical than having the function copied across multiple processors, and should also simplify the deprecation process. Of course, If you think there's a better location for this or another approach that would be more suitable, I’m open to suggestions.

Hmm, for me seems like it's better places in each processor file with copied from statements. I guess there are max 5 or so models that need swapping if i'm not wrong. For changes in general processor file, @ amyeroberts can say more

I just spoke with Amy, and a good middle-ground solution would be to keep this function in processing_utils, but outside of any class, so that models that need it would explicitly import it from processing_utils. This approach avoids cluttering the base Processor class with a very problem-specific function and limits the diffs when adding and deprecating this behavior.

yonigozlan · 2024-08-19T23:50:28Z

src/transformers/processing_utils.py

+                "You may have used the wrong order for inputs. `images` should be passed before `text`. "
+                "The `images` and `text` inputs will be swapped."


Should we include info about deprecating this behavior in a future version here?

Yes please, 2-3 major versions from current should be enough

zucchini-nlp · 2024-08-20T06:44:49Z

src/transformers/processing_utils.py

+                "You may have used the wrong order for inputs. `images` should be passed before `text`. "
+                "The `images` and `text` inputs will be swapped."


Yes please, 2-3 major versions from current should be enough

zucchini-nlp · 2024-08-20T07:11:28Z

src/transformers/processing_utils.py

+                elif isinstance(t[0], (list, tuple)):
+                    # ... list with an empty list or with a list of strings
+                    return len(t[0]) == 0 or isinstance(t[0][0], str)
+                else:


i don't think we work with nested list so texts in processors, that's usually for tokenizers with text-pair. Also the empty list, is it a valid input type?

What can be a valid type is a encoded text, which is missing here

I took an existing _is_valid_text_input function that is defined in several tokenizers, but you are right it should probably be adapted a bit for processors.
Some vlms for object detection do use nested lists of text such as Owlv2 or OmDet-Turbo, so probably better to keep this.
Encoded text are lists of int right? I will add them. Although it seems we usually don't advertise EncodedInput as an acceptable input type for text in processor, so I don't know if that is in purpose or not.

we can also reuse the is_valid_image util

Yep, I'm using valid_images which recursively uses is_valid_image to check for nested images

zucchini-nlp · 2024-08-20T07:20:30Z

src/transformers/processing_utils.py

@@ -830,6 +827,60 @@ class MyProcessingKwargs(ProcessingKwargs, CommonKwargs, TextKwargs, ImagesKwarg
            output_kwargs[modality].update(output_kwargs["common_kwargs"])
        return output_kwargs

+    def _check_reversed_images_text(self, images, text):


Hmm, for me seems like it's better places in each processor file with copied from statements. I guess there are max 5 or so models that need swapping if i'm not wrong. For changes in general processor file, @ amyeroberts can say more

amyeroberts

Thanks for adding - looking good!

Main comment is to make sure there are tests for the image,text inversion logic

src/transformers/models/llava/processing_llava.py

src/transformers/processing_utils.py

tests/models/llava/test_processor_llava.py

amyeroberts · 2024-08-22T18:32:33Z

src/transformers/processing_utils.py

@@ -120,7 +118,6 @@ class TextKwargs(TypedDict, total=False):
    return_offsets_mapping: Optional[bool]
    return_length: Optional[bool]
    verbose: Optional[bool]
-    padding_side: Optional[str]


I believe it will break things in the sense that people who were previously passing in padding_side to the processor (even if it had no effect) would now experience an error, I think?

src/transformers/processing_utils.py

amyeroberts · 2024-08-22T18:43:08Z

src/transformers/processing_utils.py

@@ -993,6 +990,59 @@ def apply_chat_template(
        )


+def _check_reversed_images_text_for_vlms(images, text):


This behaviour should be tested

@amyeroberts since this function is in processing_utils, where would the best place be to test it? Or should we create model-specific tests for all models using it?

If there isn't a module for testing utils in processing_utils then we should add one!

amyeroberts

Looking good! Changes to llava all look OK to me. I have one question about the tests for valid text inputs.

Main comment is that we should split up the changes to llava and the addition of the verification of the input order for the processors.

amyeroberts · 2024-08-29T15:48:12Z

tests/utils/test_processing_utils.py

amyeroberts · 2024-08-29T16:52:08Z

tests/utils/test_processing_utils.py

+        self.assertIsInstance(valid_images, torch.Tensor)
+        self.assertEqual(valid_text, text)


This is a bit funny - why check the value of one and the instance type of another?

I wanted to use assert functions from unittest.TestCase, and assertEqual doesn't work with tensors or numpy arrays. Also I thought since only image inputs can be tensors/np arrays, if they are indeed tensors/np arrays it means that the switch (or not) went ok. But I agree it's not very consistent with the rest, and doesn't check if the function modified the inputs during the switch.
I could use self.assertTrue(torch.equal(...)) and self.assertTrue(np.array_equal(...)) instead?

tests/utils/test_processing_utils.py

src/transformers/processing_utils.py

amyeroberts · 2024-09-03T17:35:35Z

src/transformers/processing_utils.py

@@ -120,7 +118,6 @@ class TextKwargs(TypedDict, total=False):
    return_offsets_mapping: Optional[bool]
    return_length: Optional[bool]
    verbose: Optional[bool]
-    padding_side: Optional[str]


OK, I see. More generally, we might want to think about being able to control the padding side when calling the tokenizer, as this is something we'd like to be able to handle for e.g. llava and more generally, but this is something for a wider discussion and future PRs cc @zucchini-nlp

In this case, if padding_side can't be accepted by fast tokenizers then yes, I think we should remove it here. I'd do this in a separate PR as it might affect other processors which used TextKwargs and so it would be good to introduce as an atomic change we can easily test, isolate and revert if needed

…eprecation version in warning

yonigozlan · 2024-09-13T14:52:16Z

Now that #33385 has been merged, this should be ready for review!

amyeroberts

Looks great - thanks for making our processors nice and uniform!

molbap

Catching up on reviews, just saw @amyeroberts already approved but I can add my opinion and approve as well 😁 thanks a lot for working on this. I'll get back to the other pending PRs to help the effort as well

) * Uniformize kwargs for LlaVa and update docs * Change order of processor inputs in docstring * Improve BC support for reversed images and text inputs * cleanup llava processor call docstring * Add encoded inputs as valid text inputs in reverse input check, add deprecation version in warning * Put function check reversed images text outside base processor class * Refactor _validate_images_text_input_order * Add ProcessingUtilTester * fix processing and test_processing

yonigozlan mentioned this pull request Aug 16, 2024

Uniform kwargs for processors #31911

Open

40 tasks

yonigozlan marked this pull request as ready for review August 16, 2024 20:44

yonigozlan requested review from zucchini-nlp and molbap August 16, 2024 20:44

leloykun reviewed Aug 17, 2024

View reviewed changes

src/transformers/models/llava/processing_llava.py Show resolved Hide resolved

zucchini-nlp reviewed Aug 19, 2024

View reviewed changes

src/transformers/models/llava/processing_llava.py Show resolved Hide resolved

src/transformers/models/llava/processing_llava.py Show resolved Hide resolved

leloykun reviewed Aug 19, 2024

View reviewed changes

yonigozlan commented Aug 19, 2024

View reviewed changes

zucchini-nlp reviewed Aug 20, 2024

View reviewed changes

leloykun mentioned this pull request Aug 20, 2024

Uniformize kwargs for chameleon processor #32181

Merged

yonigozlan requested a review from amyeroberts August 20, 2024 17:35

amyeroberts reviewed Aug 22, 2024

View reviewed changes

yonigozlan requested a review from amyeroberts August 28, 2024 05:14

amyeroberts reviewed Sep 3, 2024

View reviewed changes

yonigozlan force-pushed the uniformize-processors-kwargs-llava branch from 146a50a to 2a7016d Compare September 4, 2024 17:54

zucchini-nlp mentioned this pull request Sep 9, 2024

Enable padding_side as call time kwargs #33385

Merged

yonigozlan added 9 commits September 13, 2024 14:43

Uniformize kwargs for LlaVa and update docs

a8b895c

Change order of processor inputs in docstring

937054d

Improve BC support for reversed images and text inputs

31e5132

cleanup llava processor call docstring

7097e66

Add encoded inputs as valid text inputs in reverse input check, add d…

c345dc8

…eprecation version in warning

Put function check reversed images text outside base processor class

04918e7

Refactor _validate_images_text_input_order

3d8ec3d

Add ProcessingUtilTester

14af1c9

fix processing and test_processing

4dc7ada

yonigozlan force-pushed the uniformize-processors-kwargs-llava branch from 2a7016d to 4dc7ada Compare September 13, 2024 14:50

yonigozlan requested a review from amyeroberts September 13, 2024 14:52

amyeroberts approved these changes Sep 16, 2024

View reviewed changes

molbap approved these changes Sep 16, 2024

View reviewed changes

yonigozlan merged commit 2f62146 into huggingface:main Sep 16, 2024
17 checks passed

qgallouedec mentioned this pull request Sep 17, 2024

processor(prompt, images=image) to processor(images=image, text=prompt) huggingface/trl#2076

Merged

5 tasks

	def _batch_encode_plus(
	self,
	batch_text_or_text_pairs: Union[
	List[TextInput], List[TextInputPair], List[PreTokenizedInput], List[PreTokenizedInputPair]
	],
	add_special_tokens: bool = True,
	padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
	truncation_strategy: TruncationStrategy = TruncationStrategy.DO_NOT_TRUNCATE,
	max_length: Optional[int] = None,
	stride: int = 0,
	is_split_into_words: bool = False,
	pad_to_multiple_of: Optional[int] = None,
	return_tensors: Optional[str] = None,
	return_token_type_ids: Optional[bool] = None,
	return_attention_mask: Optional[bool] = None,
	return_overflowing_tokens: bool = False,
	return_special_tokens_mask: bool = False,
	return_offsets_mapping: bool = False,
	return_length: bool = False,
	verbose: bool = True,
	split_special_tokens: bool = False,
	) -> BatchEncoding:

		"You may have used the wrong order for inputs. `images` should be passed before `text`. "
		"The `images` and `text` inputs will be swapped."

		@@ -993,6 +990,59 @@ def apply_chat_template(
		)


		def _check_reversed_images_text_for_vlms(images, text):

		self.assertIsInstance(valid_images, torch.Tensor)
		self.assertEqual(valid_text, text)

Uniformize kwargs for LLaVa processor and update docs #32858

Uniformize kwargs for LLaVa processor and update docs #32858

Uh oh!

Conversation

yonigozlan commented Aug 16, 2024

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Aug 16, 2024

Uh oh!

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yonigozlan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yonigozlan Aug 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yonigozlan Aug 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amyeroberts left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yonigozlan Aug 20, 2024 •

edited

Loading

yonigozlan Aug 20, 2024 •

edited

Loading

yonigozlan Sep 3, 2024 •

edited

Loading