Uniformize kwargs for image-text-to-text processors #32544

yonigozlan · 2024-08-08T16:20:48Z

What does this PR do?

Adds uniformized processors kwargs following #31911 for the following image-text-to-text models:

I will open a separate PR for Idefics/2 as their processors are quite different from the others.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@molbap @zucchini-nlp @amyeroberts

src/transformers/models/kosmos2/processing_kosmos2.py

HuggingFaceDocBuilderDev · 2024-08-08T16:51:04Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

molbap

very nice! Jumping back to my own processors merging after that, let's go

src/transformers/models/kosmos2/processing_kosmos2.py

src/transformers/processing_utils.py

tests/models/fuyu/test_processing_fuyu.py

src/transformers/models/udop/processing_udop.py

zucchini-nlp

Great work, thanks! Looks good overall, mainly concerned about not breaking BC for users. Left a few comments

src/transformers/models/fuyu/processing_fuyu.py

tests/models/fuyu/test_processing_fuyu.py

tests/models/llava_next/test_processor_llava_next.py

zucchini-nlp · 2024-08-12T04:51:17Z

src/transformers/models/kosmos2/processing_kosmos2.py

+        # Temporary fix for "paddding_side" in init_kwargs
+        _ = self.tokenizer.init_kwargs.pop("padding_side", None)


Not very clear why we need this hack

It's related to AutoTokenizer mapping, @yonigozlan can say a bit more :)

@zucchini-nlp From what I’ve seen, some tokenizers accept padding_side as a call argument, while others don’t. But when you save weights and configs using a tokenizer loaded with AutoTokenizer and then reload them later, all possible init kwargs (including padding_side) get added to the tokenizer’s init_kwargs, even if they weren’t explicitly specified in the first place. So when merging the tokenizer.init_kwargs with the output_kwargs, if the tokenizer doesn’t support padding_side in its call function, it will cause an error.

Hopefully, that makes sense - it’s still a bit unclear for me too, to be honest. :)

Not sure I understand correctly. The basic TextKwargs have padding-side so seems like it should not be causing errors and should be assigning a kwarg to be used later, when calling the tokenizer. If users don't pass anything it will be the default kwarg from init time

I guess the main problem is that padding_side is included in the basic TextKwargs when some tokenizer encode function don't accept it as an argument such as batch_encode_plus for PretrainedTokenizerFast:

transformers/src/transformers/tokenization_utils_fast.py

Lines 489 to 510 in d6751d9

def _batch_encode_plus(

self,

batch_text_or_text_pairs: Union[

List[TextInput], List[TextInputPair], List[PreTokenizedInput], List[PreTokenizedInputPair]

],

add_special_tokens: bool = True,

padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,

truncation_strategy: TruncationStrategy = TruncationStrategy.DO_NOT_TRUNCATE,

max_length: Optional[int] = None,

stride: int = 0,

is_split_into_words: bool = False,

pad_to_multiple_of: Optional[int] = None,

return_tensors: Optional[str] = None,

return_token_type_ids: Optional[bool] = None,

return_attention_mask: Optional[bool] = None,

return_overflowing_tokens: bool = False,

return_special_tokens_mask: bool = False,

return_offsets_mapping: bool = False,

return_length: bool = False,

verbose: bool = True,

split_special_tokens: bool = False,

) -> BatchEncoding:

So maybe it shouldn't be in TextKwargs at all? Do we have an example of a tokenizer that needs to set padding_side at call time rather than at init time? What do you think @molbap ?

That is correct, no padding_side is used at call time it seems - might have been an oversight on my end to include it in the first place. If that's the case we can check it and make sure it is indeed not used, and if that's the case removing it should be doable without breaking BC

I didn't find any instances of padding_side being set at call time in Transformers, so I don't think removing it will break anything :) .
I used this regex to look in the library: (?=.*\bpadding_side\b)(?=.*\bprocessor\b)\s*(.*\S.*)
Which looks for lines where "padding_side" and "processor" are both used (and ignores leading whitespaces to avoid duplicate results)

Yes, afaik padding side currently can be set only as tokenizer.padding_side="left". Not sure if it will used any time in future as call time argument, so I am for removing it

zucchini-nlp · 2024-08-12T05:03:30Z

src/transformers/models/udop/processing_udop.py

@@ -57,28 +90,10 @@ def __call__(
        self,
        images: Optional[ImageInput] = None,
        text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
-        text_pair: Optional[Union[PreTokenizedInput, List[PreTokenizedInput]]] = None,


IMO kwargs like text_pair do not belong in the TextKwargs. Firstly, because it's not a kwarg used to change the way a text is tokenized. Secondly, it might break BC because most users wouldn't explicitly pass text_pair="My text" (e.g. our example code https://huggingface.co/docs/transformers/en/model_doc/udop#transformers.UdopForConditionalGeneration)

Same might apply to text_target, text_pair_target. Maybe we should leave it as is? Also cc @molbap , would like to hear your opinion :)

Agree they don't change tokenization, however I think they belong there though, just because they were present before and are part of the tokenizer signature: from udop docs we say

Additionally, it also supports passing text_target and text_pair_target to the tokenizer, which can be used to prepare labels for language modeling tasks.

so even if it does not explicitly change the text, it's still a tokenizer option so in terms of separation of concerns, for me it belongs here! (mostly, because it was there before)

Yes I can see how that would be a problem for backward compatibility. Maybe we should deprecate the use of text_pair, text_target etc. as args and not kwargs? Especially since they are optional and other kwargs can be used without them (e.g. inputs = processor(image, words, boxes=boxes, return_tensors="pt") in UdopModel doc). However I'm not sure how we could catch the use of too many args to provide a deprecation warning.

I'm 💯 for deprecating the usage here and not leave these args here, as we really want a unified API I don't want to create exceptions. Even if users might not use it that way/use the previous version for a while, the end goal is that other libs can also use processors with a single API, just having to inspect types to understand what a processor does.
One way to catch the deprecated usage could be to simply check if these args are present in the kwargs (and different from their default) or not instead of relying on length. You could also use inspect directly for that

I am not very strong about it, we can keep it as TextKwargs as long as we don't break BC

I've done this which is a bit hacky but should preserve BC:

transformers/src/transformers/models/udop/processing_udop.py

Lines 117 to 123 in 6d7e086

if "text_pair " not in output_kwargs["text_kwargs"]:

warnings.warn(

"No `text_pair` kwarg was detected. The use of `text_pair` as an argument without specifying it explicitely as `text_pair=` will be deprecated in future versions."

)

# for BC

if audio is not None:

output_kwargs["text_kwargs"]["text_pair"] = audio

Cool, let's do it logger,warning_once and move the warning below, so that users see it only if passing text_pair without indicating text_pair=my_text

Honestly not a big fan of this - we shouldn't be using kwargs for a hack for which they are not advertised. Will take a look this afternoon and try to suggest something

well I'm kind of out of ideas, I'll trust you on finding something clean, at worst we can do as you say but add a deprecation cycle for a few versions later. Closest I could find that does modify the signature is simply by capturing all extra args, like so

def __call__( self, images: Optional[ImageInput] = None, text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None, *args, audio=None, videos=None, **kwargs: Unpack[UdopProcessorKwargs], ) -> BatchFeature: """ This method first forwards the `images` argument to [`~UdopImageProcessor.__call__`]. In case [`UdopImageProcessor`] was initialized with `apply_ocr` set to `True`, it passes the obtained words and bounding boxes along with the additional arguments to [`~UdopTokenizer.__call__`] and returns the output, together with the prepared `pixel_values`. In case [`UdopImageProcessor`] was initialized with `apply_ocr` set to `False`, it passes the words (`text`/``text_pair`) and `boxes` specified by the user along with the additional arguments to [`~UdopTokenizer.__call__`] and returns the output, together with the prepared `pixel_values`. Alternatively, one can pass `text_target` and `text_pair_target` to prepare the targets of UDOP. Please refer to the docstring of the above two methods for more information. """ # verify input output_kwargs = self._merge_kwargs( UdopProcessorKwargs, tokenizer_init_kwargs=self.tokenizer.init_kwargs, **kwargs, ) # for BC, handle unexpected positional arguments if len(args) > 0: logger.warning_once( f"Received unexpected positional arguments. These will be mapped accordingly to `text_pair`." ) if len(args) == 1: # if there's one extra positional argument, assume it's `text_pair` for backward compatibility output_kwargs['text_kwargs']['text_pair'] = args[0]

which feels a bit less hacky if it indeed works. It would also allow to not add extra placeholder args when we have more than one extra arg to take care of, as in this cool work: #32180
However I will take my annual holidays soon so I won't be able to decide more on that, I'll trust you to move on with something incredible anyways as you've done amazing work already @yonigozlan @zucchini-nlp 💜

src/transformers/models/kosmos2/processing_kosmos2.py

zucchini-nlp · 2024-08-12T06:24:33Z

Btw, i just realized we are swapping input args order: it was text-first and now it will be image-first. AFAIK most people are used to pass text and then image in LLaVa models, without indicating the arg-name (inputs = processor(prompt, raw_image, return_tensors="pt").to(device, torch.float16)). So we might need to catch those cases

UPDATE: also, our slow tests for llava (not sure about others) also don't follow the new order, so we should update them

yonigozlan · 2024-08-15T23:17:05Z

So this PR is getting bigger than I anticipated :). I think it's close to ready to be merged so I re-requested reviews, but maybe I should break it up in smaller PRs first? cc @zucchini-nlp @molbap

zucchini-nlp

Great work, LGTM!

docs/source/en/model_doc/fuyu.md

zucchini-nlp · 2024-08-16T04:29:51Z

tests/models/udop/test_processor_udop.py

    PreTrainedTokenizerBase,
    PreTrainedTokenizerFast,
+    UdopProcessor,


UdpoProcessor is imported below if pytesseract is available, so imo we don't need to add it here

I removed it from the pytesseract check instead, as there is a strange bug where the line below the class definition (processor_class = UdopProcessor) will still be executed even if the "requires" are not satisfied, which makes the CI break

Does that mean UDOP has no dependency on pytesseract to run the processor test and will run successfully?

Only the image_processor (LayoutLMv3ImageProcessor) depends on pytesseract, but since the import check is already done at the level of LayoutLMv3ImageProcessor, it doesn't seem to me that it should also be done when importing the processor? Though I'm not sure how these nested requirement checks should be dealt with.

i see, that's weird because the Tester class has a require_pytesseract dependency which afaik is same as is_pytesseract_available()

Actually only importing processor has no dependencies, from what I see it doesn't use pytesseract directly. So it should be ok to import it as is, and the tests should be skipped by require_pytesseract if package is not installed. I'm just curious why that broke, if you have bandwidth to explore that. I couldn't reproduce it by removing UdopProcessor from general imports

molbap · 2024-08-16T10:11:10Z

As noted I'm not a fan of the use of audio to alleviate text_pair and such - this hack is used in #32841 and #32845 and I think it'd be setting the wrong precedent in the codebase, all the rest is perfectly fine

molbap · 2024-08-16T13:43:31Z

src/transformers/models/udop/processing_udop.py

    def model_input_names(self):
-        return ["input_ids", "bbox", "attention_mask", "pixel_values"]
+        return ["pixel_values", "input_ids", "attention_mask", "bbox"]


just noted: why change the order here?

I changed the returned object of UdopProcessor from a BatchEncoding to a BatchFeature by updating the encoded images with the encoded text, and not the other way around, which changed the order of the output keys

yonigozlan · 2024-09-16T20:22:06Z

Removed Udop from this PR as it has some specific args to handle, so waiting on this #33479 to be merged before opening another PR for it.
Meanwhile, this PR is ready for review!

…ze kwargs

…d by BatchEncoding -> BatchFeature

…s, add BC for text_pair as arg for Udop

…ning

amyeroberts

Really nice work ❤️

Great to see the combine efforts to make a clean processor interface being propagated to clean up the codebase 🧹

Just a few small comments - main ones about the commented out code

tests/models/llava_next/test_processor_llava_next.py

amyeroberts · 2024-09-20T17:10:24Z

tests/models/pix2struct/test_processor_pix2struct.py

+
+    # @require_vision
+    # @require_torch
+    # def test_tokenizer_defaults_preserved_by_kwargs(self):


To uncomment?

Yes they can even be removed, forgot to do it thanks

amyeroberts · 2024-09-20T17:10:29Z

tests/models/pix2struct/test_processor_pix2struct.py

+
+    # @require_vision
+    # @require_torch
+    # def test_kwargs_overrides_default_tokenizer_kwargs(self):


amyeroberts · 2024-09-20T17:11:40Z

tests/models/instructblip/test_processor_instructblip.py

@@ -179,261 +179,3 @@ def test_model_input_names(self):
            list(inputs.keys()),
            ["input_ids", "attention_mask", "qformer_input_ids", "qformer_attention_mask", "pixel_values"],
        )
-


So much code deletion 🤩

src/transformers/models/kosmos2/processing_kosmos2.py

amyeroberts · 2024-09-20T17:20:07Z

src/transformers/models/kosmos2/processing_kosmos2.py

+        # Temporary fix for "paddding_side" in init_kwargs
+        _ = output_kwargs["text_kwargs"].pop("padding_side", None)


Is this still needed? I can't remember the state of the solution for this

No it shouldn't be needed anymore! Thanks for catching that :)

amyeroberts · 2024-09-20T17:31:35Z

src/transformers/models/kosmos2/processing_kosmos2.py

-                return_tensors=return_tensors if images is None else None,
-                **kwargs,
+            output_kwargs["text_kwargs"]["add_special_tokens"] = (
+                output_kwargs["text_kwargs"]["add_special_tokens"] and add_eos_token


I know this is matching the logic above but this seems like it would produce some very surprising behaviour 👀 (not a comment saying you should change things, just noting)

amyeroberts · 2024-09-20T17:58:01Z

tests/models/llava_next/test_modeling_llava_next.py

+            images=[lowres_img, cats_image], text=[self.prompt, self.prompt], return_tensors="pt", padding=True
+        ).to(torch_device)
+
+        model.train()


Why set to training mode here? Is there an assertion on right padding because of this?

Not sure what happened here as I don't think I've made those changes 😅, maybe the rebase went wrong at some point. I will remove all that.

amyeroberts · 2024-09-20T18:21:48Z

tests/models/llava_next/test_modeling_llava_next.py

+            images=[lowres_img, cats_image], text=[self.prompt, self.prompt], return_tensors="pt", padding=True
+        ).to(torch_device)
+
+        model.train()


same q here about forcing into training mode

same as above

yonigozlan commented Aug 8, 2024

View reviewed changes

src/transformers/models/kosmos2/processing_kosmos2.py Outdated Show resolved Hide resolved

yonigozlan mentioned this pull request Aug 8, 2024

Uniform kwargs for processors #31911

Open

40 tasks

yonigozlan requested review from molbap, amyeroberts and zucchini-nlp August 8, 2024 22:17

yonigozlan marked this pull request as ready for review August 8, 2024 22:32

molbap reviewed Aug 9, 2024

View reviewed changes

yonigozlan commented Aug 9, 2024

View reviewed changes

src/transformers/models/udop/processing_udop.py Outdated Show resolved Hide resolved

zucchini-nlp reviewed Aug 12, 2024

View reviewed changes

yonigozlan mentioned this pull request Aug 12, 2024

Modify ProcessorTesterMixin for better generalization #32637

Merged

5 tasks

yonigozlan force-pushed the uniformize-image-text-to-text-processors-kwargs branch from 02a4ec3 to 76bb138 Compare August 13, 2024 16:33

yonigozlan requested review from zucchini-nlp and molbap August 15, 2024 23:17

zucchini-nlp approved these changes Aug 16, 2024

View reviewed changes

zucchini-nlp mentioned this pull request Aug 16, 2024

Uniformize kwargs for Layoutlm (2, 3, X) processors #32180

Open

leloykun mentioned this pull request Aug 16, 2024

Uniformize model processors (models *with* special arg names) #32841

Open

5 tasks

molbap reviewed Aug 16, 2024

View reviewed changes

yonigozlan mentioned this pull request Aug 19, 2024

Uniformize kwargs for LLaVa processor and update docs #32858

Merged

5 tasks

yonigozlan force-pushed the uniformize-image-text-to-text-processors-kwargs branch 3 times, most recently from 4e23ee7 to f5d8507 Compare September 16, 2024 19:48

yonigozlan added 4 commits September 20, 2024 15:54

uniformize FUYU processor kwargs

083b4bd

Uniformize instructblip processor kwargs

425ed0e

Fix processor kwargs and tests Fuyu, InstructBlip, Kosmos2

dda5c5d

Uniformize llava_next processor

c2806f4

yonigozlan added 24 commits September 20, 2024 15:56

Fix import Unpack

9538867

Fix Fuyu Processor import

d8311d2

Fix FuyuProcessor import

770eb38

Fix FuyuProcessor

20c1e6e

Add defaults for specific kwargs kosmos2

b887c6d

Fix Udop to return BatchFeature instead of BatchEncoding and uniformi…

325ce26

…ze kwargs

Add tests processor Udop

a7fcb8b

remove Copied from in processing Udop as change of input orders cause…

8a296f5

…d by BatchEncoding -> BatchFeature

Fix overwrite tests kwargs processors

f1be841

Add warnings and BC for changes in processor inputs order, change doc…

58b70a1

…s, add BC for text_pair as arg for Udop

Fix processing test fuyu

99f6673

remove unnecessary pad_token check in instructblip ProcessorTest

e0fecb5

Fix BC tests and cleanup

2172b9d

FIx imports fuyu

3557693

Uniformize Pix2Struct

228acee

Fix wrong name for FuyuProcessorKwargs

ce68136

Fix slow tests reversed inputs align fuyu llava-next, change udop war…

5db32c9

…ning

Fix wrong logging import udop

6b95100

Add check images text input order

120a370

Fix copies

5657519

change text pair handling when positional arg

7b4bcb5

rebase on main, fix imports in test_processing_common

41b5d4c

remove optional args and udop uniformization from this PR

7420705

fix failing tests

e6ceb28

yonigozlan force-pushed the uniformize-image-text-to-text-processors-kwargs branch from cb961ad to e6ceb28 Compare September 20, 2024 16:08

yonigozlan added 2 commits September 20, 2024 16:31

remove unnecessary test, fix processing utils and test processing common

ac91b1e

cleanup Unpack

aecbd9e

molbap mentioned this pull request Sep 20, 2024

Uniformize kwargs for Udop processor and update docs #33628

Open

5 tasks

amyeroberts reviewed Sep 20, 2024

View reviewed changes

cleanup

0e75363

		# Temporary fix for "paddding_side" in init_kwargs
		_ = self.tokenizer.init_kwargs.pop("padding_side", None)

	def _batch_encode_plus(
	self,
	batch_text_or_text_pairs: Union[
	List[TextInput], List[TextInputPair], List[PreTokenizedInput], List[PreTokenizedInputPair]
	],
	add_special_tokens: bool = True,
	padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
	truncation_strategy: TruncationStrategy = TruncationStrategy.DO_NOT_TRUNCATE,
	max_length: Optional[int] = None,
	stride: int = 0,
	is_split_into_words: bool = False,
	pad_to_multiple_of: Optional[int] = None,
	return_tensors: Optional[str] = None,
	return_token_type_ids: Optional[bool] = None,
	return_attention_mask: Optional[bool] = None,
	return_overflowing_tokens: bool = False,
	return_special_tokens_mask: bool = False,
	return_offsets_mapping: bool = False,
	return_length: bool = False,
	verbose: bool = True,
	split_special_tokens: bool = False,
	) -> BatchEncoding:

	if "text_pair " not in output_kwargs["text_kwargs"]:
	warnings.warn(
	"No `text_pair` kwarg was detected. The use of `text_pair` as an argument without specifying it explicitely as `text_pair=` will be deprecated in future versions."
	)
	# for BC
	if audio is not None:
	output_kwargs["text_kwargs"]["text_pair"] = audio

		# Temporary fix for "paddding_side" in init_kwargs
		_ = output_kwargs["text_kwargs"].pop("padding_side", None)

Uniformize kwargs for image-text-to-text processors #32544

Are you sure you want to change the base?

Uniformize kwargs for image-text-to-text processors #32544

Conversation

yonigozlan commented Aug 8, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Aug 8, 2024

molbap left a comment

Choose a reason for hiding this comment

zucchini-nlp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

molbap Aug 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zucchini-nlp commented Aug 12, 2024 • edited Loading

yonigozlan commented Aug 15, 2024

zucchini-nlp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zucchini-nlp Aug 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

molbap commented Aug 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yonigozlan commented Sep 16, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yonigozlan commented Aug 8, 2024 •

edited

Loading

molbap Aug 12, 2024 •

edited

Loading

zucchini-nlp commented Aug 12, 2024 •

edited

Loading

zucchini-nlp Aug 19, 2024 •

edited

Loading