Uniform kwargs for processors #31911

zucchini-nlp · 2024-07-11T13:09:22Z

Feature request

We want to standardize the logic flow through Processor classes. Since processors can have different kwargs depending on the model and modality, we are adding a TypedDict for each modality to keep track of which kwargs are accepted.

The initial design is merged and an example model is modified to follow the new uniform processor kwargs in #31198. Also #31197 has two more examples with standardized API.

This design has to be shipped to all the processors in Transformers, and appreciate contributions.
Below is an incomplete list of models that need standardization, feel free to add a model if it's missing:

Note: For now we'll start with image or image+text, #31368 is an ongoing PR that has also audio processor standardization

Motivation

.

Your contribution

.

The text was updated successfully, but these errors were encountered:

zucchini-nlp · 2024-07-11T13:10:23Z

cc @molbap @NielsRogge , I added only models I see commonly to the list and all VLMs to unblock the pipeline

davidgxue · 2024-07-11T18:33:23Z

I can take CLIP and LLaVa

zucchini-nlp · 2024-07-12T04:20:22Z

@davidgxue okey, feel free to open a PR when it's ready.

OmarManzoor · 2024-07-12T15:00:23Z

I would like to work on BLIP-2. Just to clarify we only need to change BLIP-2 right and not BLIP? Because there is a comment which mentions

# Copied from transformers.models.blip.processing_blip.BlipProcessor.__call__

zucchini-nlp · 2024-07-12T15:28:56Z

@OmarManzoor my bad, forgot to add Blip to the list. You can work on Blip and all changes from BLIP will be ported to BLIP2 automatically :)

I'll add Blip to the list and assign to you then

molbap · 2024-07-15T08:23:04Z

@OmarManzoor @zucchini-nlp missed it, I already started work on a few models here. Please check the original PRs here #31198 and here #31368 , BLIP, BLIP-2, Donut and a couple more are already handled

OmarManzoor · 2024-07-15T08:29:40Z

Please check the original PRs here #31198 and here #31368 , BLIP, BLIP-2, Donut and a couple more are already handled

Thank you for clarifying.

zucchini-nlp · 2024-07-19T15:09:00Z

@leloykun this is one of the trackers we have for start. There'a another PR for standardizing VLM from generation perspective. And unfortunately other tasks will be blocked by these.

If you want to work on this task or maybe in making a wrapper for VLMTokenizer, let me know!

leloykun · 2024-07-19T16:17:40Z

thanks @zucchini-nlp!

I can take LayoutLM (1, 2, 3) & Chameleon

bhuvanmdev · 2024-07-28T02:56:32Z

I can take owlv2 and vit. But for owlv2 there are multiple helper functions and classes that are being copied from owlvit, so does that mean i need to work on owlvit?

# Copied from transformers.models.owlvit.processing_owlvit.OwlViTProcessor.__call__ with OWLViT->OWLv2

MnCSSJ4x · 2024-07-28T21:15:00Z

Can I take dino and Paligemma if no ones working on it?

zucchini-nlp · 2024-07-29T05:02:23Z

@bhuvanmdev yes, if owlvit processing code is identical to owl, it will be simply copied from

@MnCSSJ4x sure

MnCSSJ4x · 2024-07-30T16:44:56Z

@zucchini-nlp I started working on Paligemma and tried to follow the PRs mentioned here. In paligemma there is not test file for the processor. Do I need to add those tests (if that's the case, please point me to how can I do that) and also can that be used directly to check if the changes are non breaking? I can raise a temp PR so that we can discuss it there.

zucchini-nlp · 2024-07-31T05:03:42Z

@MnCSSJ4x yes, feel free to open a PR so that we can discuss it there. And yes, in that case we need to add the test file and ProcessorTesterMixin to it, so that new changes are all tested

MnCSSJ4x · 2024-08-01T13:57:17Z

@MnCSSJ4x yes, feel free to open a PR so that we can discuss it there. And yes, in that case we need to add the test file and ProcessorTesterMixin to it, so that new changes are all tested

Thanks I have created a PR #32377 and tagged you there. Please let me know how can I get started regarding testing the same.

yonigozlan · 2024-08-07T02:39:42Z

I can work on the remaining image-text-to-text models (Fuyu, Idefics/2, InstructBlip, Kosmos-2, LLaVa-NeXT) as I have already been working on their processors for #32471 .

zucchini-nlp · 2024-08-07T04:01:32Z

@yonigozlan Thanks, that would be great!

leloykun · 2024-08-16T08:25:45Z

Since nobody has claimed them yet, dibs on Nougat and SigLip

They should now be ready for review + they already have backwards compatibility support

leloykun · 2024-08-16T08:43:47Z

Btw, DepthAnything, Dino, Maskformer, & VIT don't have processors

leloykun · 2024-08-16T09:23:33Z

@zucchini-nlp here's for the rest of the processors: #32845

zucchini-nlp · 2024-08-16T09:35:10Z

@leloykun WOW, thanks a lot! I can review those on Monday, today will be a bit busy

leloykun · 2024-08-16T11:07:39Z

Thanks too!

For now, the PR for Chameleon, #32181, is the safest to merge as (1) the processor doesn't expect special args (e.g. text_pair and such) and (2) the PR also already has tests

The PRs for Nougat #32841 and the LayoutLM models #32180 need more thought as their processors expect special args (cc @yonigozlan, I think @molbap is right that our current implementation is kinda wonky)

while the other PRs don't have tests yet

yonigozlan · 2024-08-16T14:38:16Z

No problem @davidgxue! I will get started on LLaVa then

leloykun · 2024-08-16T16:16:56Z

Summary of my progress:

Model	Status	Has Tests?	Special Args	PR
Chameleon	Ready	Yes	-	#32181
---	---	---	---	---
AltCLIP	Ready	Yes	-	#32845
Flava	Ready	Yes	-	#32845
Git	Ready	Yes	-	#32845
InstructBlipVideo	Ready	Yes	-	#32845
LLaVa-NeXT-Video	Ready	Yes	-	#32845
MGP	Ready	Yes	-	#32845
Siglip	Ready	Yes	-	#32845
TVP	Ready	Yes	-	#32845
VideoLLaVa	Ready	Yes	-	#32845
VILT	Ready	Yes	-	#32845
X-CLIP	Ready	Yes	-	#32845
---	---	---	---	---
LayoutLMv2	Ready	Yes	~~`text_pair`~~, `boxes`, `word_labels`	#32180
LayoutLMv3	Ready	Yes	~~`text_pair`~~, `boxes`, `word_labels`	#32180
LayoutXLM	Ready	Yes	~~`text_pair`~~, `boxes`, `word_labels`	#32180
---	---	---	---	---
ClipSeg	Ready	Yes	`visual_prompt`	#32841
Nougat	Ready	Yes	~~`text_pair`, `text_target`, `text_pair_target`~~(apparently these are in the tokenizer base class)	#32841
OwlV2	Ready	Yes	`query_images`	#32841
OwlVIT	Ready	Yes	`query_images`	#32841
---	---	---	---	---
Clap	Not Ready	No	??	#32906
CLVP	Not Ready	No	??	#32906
MusicGen Melody	Not Ready	No	??	#32906
PopPiano	Not Ready	No	??	#32906
Qwen2 Audio	Not Ready	No	??	#32906
Seamless M4T	Not Ready	No	??	#32906
SpeechT5	Ready	Yes	-	#32906
Wav2Vec2 Bert	Ready	Yes	-	#32906

#32181 & #32845 are now ready for review. I could also decouple InstructBlipVideo, LLaVa-NeXT-Video, Siglip, and VideoLLaVa to a separate PR just to get them out--just lemme know if there's a need to rush.

~~The rest have special args that we still have to figure out how to handle. cc @yonigozlan~~

Update: all of the PRs are now ready for review

leloykun · 2024-08-16T16:52:46Z

Processors with weird output keys:

Model	weird key to expected key mapping
LayoutLMv2	`image` -> `pixel_values`
LayoutXLM	`image` -> `pixel_values`
MGP	`labels` -> `input_ids`
Nougat	`labels` -> `input_ids`
TVP	`pixel_values` -> `pixel_values_videos`
VideoLlava	`pixel_values_images` -> `pixel_values`
X-Clip	`pixel_values` -> `pixel_values_videos`

@molbap @zucchini-nlp what do you think of standardizing/uniformizing these too?

leloykun · 2024-08-16T22:52:40Z

leloykun · 2024-08-17T10:29:47Z

@molbap @yonigozlan I wanna raise this here so our implementations would be more consistent:

I've implemented a saner way to handle special processor call args for backwards compatibility that doesn't involve re-using unrelated args (e.g. audio & video) and doesn't need extra arguments like backwards_compatibility_placeholder_arg.

Tl;dr: I followed @molbap's advise to capture them using *args instead and added prepare_and_validate_optional_call_args to auto convert them to kwargs which we can then pass to _merge_kwargs along with the other arguments.

See my implementation here #32841 and here #32180

For everyone else: the "special arguments" here are arguments that carry data from user input, but aren't named text, images, audio, nor videos nor are configs to the tokenizer, image processor, etc. For example, the bounding boxes for LayoutLM* models, the visual prompt for ClipSeg, etc.

The problem with these args is that some users pass them as positional arguments to the processors. So, if we wanna restrict the processor call arguments to only those four and kwargs, then we're gonna have problems handling these special arguments.

Now, we only need to do:

Add the ModelProcessorKwargs class (same as before)
Add optional_call_args = [...] as an attribute to the processor class
Add *args, to where in the call signature the special arguments were before (e.g. for LayoutLM* models, it's right after text and images)
Add **self.prepare_and_validate_optional_call_args(*args), as an argument to self._merge_kwargs. E.g.:

output_kwargs = self._merge_kwargs(
    CLIPSegProcessorKwargs,
    tokenizer_init_kwargs=self.tokenizer.init_kwargs,
    **kwargs,
    **self.prepare_and_validate_optional_call_args(*args),
)

Alternatively, I could move prepare_and_validate_optional_call_args to _merge_kwargs and perhaps rename it to _merge_args_and_kwargs. This way, the interface would be something like this instead:

output_kwargs = self._merge_args_and_kwargs(
    CLIPSegProcessorKwargs,
    tokenizer_init_kwargs=self.tokenizer.init_kwargs,
    *args,
    **kwargs,
)

Lemme know what you think

yonigozlan · 2024-08-17T20:10:27Z

That looks great to me @leloykun thanks a lot for working on that! I personally like it as you've done it in #32841 and #32180 with self.prepare_and_validate_optional_call_args(*args), as it makes it easy to remove once this behavior is fully deprecated. For that reason I don't think we should replace _merge_kwargs by a _merge_args_and_kwargs.
If others agree, I will add this to Udop and other models I have been working on if they need them.

zucchini-nlp · 2024-08-19T07:19:07Z

For me it looks an good workaround, and should handle BC correctly. I left comments in the corresponding PRs, and I agree about renaming _merge_kwargs.

I'll leave some general questions/comments here, as there are so many PRs currently and I might miss some of them. So, as part of standardization we have to

use order as image, text in processors. I see confusion around why we're swapping these. Since I wasn't very closely working on standardization, my guess is that it's a common pattern from most processors and might help with pipelines. @yonigozlan if you have any ideas as you've been working closely with @molbap
IMO we have to return BatchFeature and not BatchEncoding in processor's call, as was done in some PRs already. I don't see any reason why BatchEncoding so this shouldn't cause issues
use the above proposed way to deprecate extra args and add a major version after which positional args won;t be supported, that can be v4.47

leloykun · 2024-08-19T09:41:53Z

@yonigozlan @molbap I don't see any good reason why we should swap the args. It just adds complications with backwards compatibility. And besides, we can just pass them as keyword arguments in the pipelines.

yonigozlan · 2024-08-19T13:30:44Z

To me, swapping the args is part of the effort to standardize the processors. Right now, we have some VLMs where the processor takes images first and others where it takes text first. Even if we discourage using positional arguments for processor calls, I imagine most users will still use them for images and text. Having to guess which comes first depending on the model doesn't seem ideal.

That said, since the argument swapping isn't a blocker for the image-text-to-text pipeline, I suggest we remove it for now to get this merged as soon as possible. The backward compatibility is indeed tricky to manage, and the current checks feel a bit too hacky for a merge. We can open a separate PR later to focus on argument swapping if we decide it's necessary.

How does that sound, @zucchini-nlp, @leloykun?

leloykun · 2024-08-19T13:37:06Z

@yonigozlan yup, that sounds good.

btw, @yonigozlan @zucchini-nlp @molbap what do you think of forcing the use of keyword arguments in future versions? I.e. having this signature?

def __call__(
    self,
    *,
    text: ...,
    images: ...,
    audio: ...,
    videos: ...,
    **kwargs: Unpack[...],
) -> BatchFeature:

amyeroberts · 2024-08-19T14:41:58Z

@leloykun This is something we could consider for e.g. v5 of transformers but I wouldn't enforce it at the moment as we're going through minor versions: this would break a lot of code for a lot of people.

zucchini-nlp · 2024-08-19T14:56:28Z

Afaik the order is swapped only in some VLMs and we want to follow image, text order in the end. Indeed there are many BC handlings happening, but swapping positions doesn't seem very hard to handle. Also, it won't flood cmd with warnings because processors with swapped order usually have no other cases to handle.

So, I am for changing order of args as part of standardization, and address comments if there are any to make the checks more reliable. If you are more comfortable with having a separate PR, I'm okay with separating out VLMs from current PR and opening a new one.

What I suggest for faster iteration is to first review and merge one model, for ex, LLaVa has a PR for itself now. After everyone is happy with it, we can merge and copy changes in other models. Same for handling "special positional args", I guess @leloykun had a PR with one model only.

yonigozlan · 2024-08-19T16:54:27Z

@zucchini-nlp Sounds good! I will probably do a separate PR for each model that needs the argument switch, just because it adds a lot of noisy changes in tests, docstrings etc. I will also work on finding a better way to support BC and implement it first in the LLaVa PR. Will ping you when that's done :)

leloykun · 2024-08-21T00:11:40Z

Also started uniformization of kwargs of processors of audio-text models here: #32906

It's still a draft on top of #32845 but SpeechT5 & Wav2Vec2 Bert should be done now

leloykun · 2024-08-21T05:06:42Z

hmmm... now that I think about it... why is it audio instead of audios?

molbap · 2024-09-18T14:36:07Z

@leloykun I chose audio instead of audios because the second one, despite being a valid English word, is barely used. videos is valid because it is a shortcut for videotapes. So videos is the countable form, but there is no much sense in having a countable form for audio, so the uncountable plural form audio is kept.

zucchini-nlp added Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want! contributions-welcome labels Jul 11, 2024

NielsRogge mentioned this issue Jul 14, 2024

[GroundingDino] - GroundingDinoProcessor kwargs is Broken #31952

Closed

4 tasks

SangbumChoi mentioned this issue Jul 15, 2024

Uniformize kwargs for processors - GroundingDINO #31964

Merged

4 tasks

molbap mentioned this issue Jul 15, 2024

Adding mplugdocowl #31792

Open

5 tasks

NielsRogge mentioned this issue Jul 18, 2024

AutoModel class for image-text-to-text models #32042

Open

yonigozlan mentioned this issue Jul 18, 2024

[WIP] Standardize inputs and outputs for existing image-text-to-text models #32059

Draft

20 tasks

This was referenced Jul 24, 2024

Uniformize kwargs for Layoutlm (2, 3, X) processors #32180

Open

Uniformize kwargs for chameleon processor #32181

Open

MnCSSJ4x mentioned this issue Aug 1, 2024

Adds uniform processing kwargs to paligemma. #32377

Open

2 tasks

leloykun mentioned this issue Aug 6, 2024

Add Idefics 3! #32473

Open

5 tasks

This was referenced Aug 8, 2024

Uniformize kwargs for image-text-to-text processors #32544

Open

Uniformize kwargs for Idefics/2 processors #32568

Open

leloykun mentioned this issue Aug 16, 2024

Uniformize processor kwargs of siglip #32842

Closed

1 task

leloykun mentioned this issue Aug 16, 2024

Uniformize model processors (models w/o special arg names) #32845

Open

12 tasks

yonigozlan mentioned this issue Aug 16, 2024

Uniformize kwargs for LLaVa processor and update docs #32858

Merged

5 tasks

leloykun mentioned this issue Aug 21, 2024

Uniform kwargs for processors of audio-text models #32906

Draft

9 tasks

leloykun mentioned this issue Aug 22, 2024

Auto model & pipeline for image-text-to-image-text models #32926

Open

14 tasks

yonigozlan mentioned this issue Sep 3, 2024

Add validate images and text inputs order util for processors and test_processing_utils #33285

Merged

5 tasks

yonigozlan mentioned this issue Sep 16, 2024

Uniformize kwargs for Pixtral processor #33521

Merged

5 tasks

This was referenced Sep 18, 2024

Uniformize kwargs for Paligemma processor and update docs #33571

Merged

Uniformize kwargs for Udop processor and update docs #33628

Open

Uniform kwargs for processors #31911

Uniform kwargs for processors #31911

Comments

zucchini-nlp commented Jul 11, 2024 • edited by yonigozlan Loading

Feature request

Motivation

Your contribution

zucchini-nlp commented Jul 11, 2024

davidgxue commented Jul 11, 2024

zucchini-nlp commented Jul 12, 2024

OmarManzoor commented Jul 12, 2024

zucchini-nlp commented Jul 12, 2024

molbap commented Jul 15, 2024

OmarManzoor commented Jul 15, 2024

zucchini-nlp commented Jul 19, 2024

leloykun commented Jul 19, 2024

bhuvanmdev commented Jul 28, 2024 • edited Loading

MnCSSJ4x commented Jul 28, 2024 • edited Loading

zucchini-nlp commented Jul 29, 2024

MnCSSJ4x commented Jul 30, 2024 • edited Loading

zucchini-nlp commented Jul 31, 2024

MnCSSJ4x commented Aug 1, 2024

yonigozlan commented Aug 7, 2024

zucchini-nlp commented Aug 7, 2024

leloykun commented Aug 16, 2024 • edited Loading

leloykun commented Aug 16, 2024

leloykun commented Aug 16, 2024

zucchini-nlp commented Aug 16, 2024

leloykun commented Aug 16, 2024

yonigozlan commented Aug 16, 2024

leloykun commented Aug 16, 2024 • edited Loading

leloykun commented Aug 16, 2024 • edited Loading

leloykun commented Aug 16, 2024 • edited Loading

leloykun commented Aug 17, 2024

yonigozlan commented Aug 17, 2024

zucchini-nlp commented Aug 19, 2024

leloykun commented Aug 19, 2024

yonigozlan commented Aug 19, 2024

leloykun commented Aug 19, 2024 • edited Loading

amyeroberts commented Aug 19, 2024

zucchini-nlp commented Aug 19, 2024

yonigozlan commented Aug 19, 2024

leloykun commented Aug 21, 2024

leloykun commented Aug 21, 2024

molbap commented Sep 18, 2024

zucchini-nlp commented Jul 11, 2024 •

edited by yonigozlan

Loading

bhuvanmdev commented Jul 28, 2024 •

edited

Loading

MnCSSJ4x commented Jul 28, 2024 •

edited

Loading

MnCSSJ4x commented Jul 30, 2024 •

edited

Loading

leloykun commented Aug 16, 2024 •

edited

Loading

leloykun commented Aug 16, 2024 •

edited

Loading

leloykun commented Aug 16, 2024 •

edited

Loading

leloykun commented Aug 16, 2024 •

edited

Loading

leloykun commented Aug 19, 2024 •

edited

Loading