-
Notifications
You must be signed in to change notification settings - Fork 26.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uniformize kwargs for image-text-to-text processors #32544
base: main
Are you sure you want to change the base?
Uniformize kwargs for image-text-to-text processors #32544
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice! Jumping back to my own processors merging after that, let's go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, thanks! Looks good overall, mainly concerned about not breaking BC for users. Left a few comments
# Temporary fix for "paddding_side" in init_kwargs | ||
_ = self.tokenizer.init_kwargs.pop("padding_side", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not very clear why we need this hack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's related to AutoTokenizer
mapping, @yonigozlan can say a bit more :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zucchini-nlp From what I’ve seen, some tokenizers accept padding_side
as a call argument, while others don’t. But when you save weights and configs using a tokenizer loaded with AutoTokenizer
and then reload them later, all possible init kwargs (including padding_side
) get added to the tokenizer’s init_kwargs
, even if they weren’t explicitly specified in the first place. So when merging the tokenizer.init_kwargs
with the output_kwargs
, if the tokenizer doesn’t support padding_side
in its call function, it will cause an error.
Hopefully, that makes sense - it’s still a bit unclear for me too, to be honest. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand correctly. The basic TextKwargs have padding-side
so seems like it should not be causing errors and should be assigning a kwarg to be used later, when calling the tokenizer. If users don't pass anything it will be the default kwarg from init time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the main problem is that padding_side
is included in the basic TextKwargs
when some tokenizer encode function don't accept it as an argument such as batch_encode_plus
for PretrainedTokenizerFast
:
transformers/src/transformers/tokenization_utils_fast.py
Lines 489 to 510 in d6751d9
def _batch_encode_plus( | |
self, | |
batch_text_or_text_pairs: Union[ | |
List[TextInput], List[TextInputPair], List[PreTokenizedInput], List[PreTokenizedInputPair] | |
], | |
add_special_tokens: bool = True, | |
padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD, | |
truncation_strategy: TruncationStrategy = TruncationStrategy.DO_NOT_TRUNCATE, | |
max_length: Optional[int] = None, | |
stride: int = 0, | |
is_split_into_words: bool = False, | |
pad_to_multiple_of: Optional[int] = None, | |
return_tensors: Optional[str] = None, | |
return_token_type_ids: Optional[bool] = None, | |
return_attention_mask: Optional[bool] = None, | |
return_overflowing_tokens: bool = False, | |
return_special_tokens_mask: bool = False, | |
return_offsets_mapping: bool = False, | |
return_length: bool = False, | |
verbose: bool = True, | |
split_special_tokens: bool = False, | |
) -> BatchEncoding: |
So maybe it shouldn't be in TextKwargs at all? Do we have an example of a tokenizer that needs to set padding_side at call time rather than at init time? What do you think @molbap ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct, no padding_side
is used at call time it seems - might have been an oversight on my end to include it in the first place. If that's the case we can check it and make sure it is indeed not used, and if that's the case removing it should be doable without breaking BC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't find any instances of padding_side
being set at call time in Transformers, so I don't think removing it will break anything :) .
I used this regex to look in the library: (?=.*\bpadding_side\b)(?=.*\bprocessor\b)\s*(.*\S.*)
Which looks for lines where "padding_side" and "processor" are both used (and ignores leading whitespaces to avoid duplicate results)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, afaik padding side currently can be set only as tokenizer.padding_side="left"
. Not sure if it will used any time in future as call time argument, so I am for removing it
@@ -57,28 +90,10 @@ def __call__( | |||
self, | |||
images: Optional[ImageInput] = None, | |||
text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None, | |||
text_pair: Optional[Union[PreTokenizedInput, List[PreTokenizedInput]]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO kwargs like text_pair
do not belong in the TextKwargs
. Firstly, because it's not a kwarg used to change the way a text is tokenized. Secondly, it might break BC because most users wouldn't explicitly pass text_pair="My text"
(e.g. our example code https://huggingface.co/docs/transformers/en/model_doc/udop#transformers.UdopForConditionalGeneration)
Same might apply to text_target, text_pair_target. Maybe we should leave it as is? Also cc @molbap , would like to hear your opinion :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree they don't change tokenization, however I think they belong there though, just because they were present before and are part of the tokenizer
signature: from udop docs we say
Additionally, it also supports passing
text_target
andtext_pair_target
to the tokenizer, which can be used to prepare labels for language modeling tasks.
so even if it does not explicitly change the text, it's still a tokenizer option so in terms of separation of concerns, for me it belongs here! (mostly, because it was there before)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I can see how that would be a problem for backward compatibility. Maybe we should deprecate the use of text_pair, text_target etc. as args and not kwargs? Especially since they are optional and other kwargs can be used without them (e.g. inputs = processor(image, words, boxes=boxes, return_tensors="pt")
in UdopModel
doc). However I'm not sure how we could catch the use of too many args to provide a deprecation warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm 💯 for deprecating the usage here and not leave these args here, as we really want a unified API I don't want to create exceptions. Even if users might not use it that way/use the previous version for a while, the end goal is that other libs can also use processors with a single API, just having to inspect types to understand what a processor does.
One way to catch the deprecated usage could be to simply check if these args are present in the kwargs
(and different from their default) or not instead of relying on length. You could also use inspect
directly for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not very strong about it, we can keep it as TextKwargs as long as we don't break BC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done this which is a bit hacky but should preserve BC:
transformers/src/transformers/models/udop/processing_udop.py
Lines 117 to 123 in 6d7e086
if "text_pair " not in output_kwargs["text_kwargs"]: | |
warnings.warn( | |
"No `text_pair` kwarg was detected. The use of `text_pair` as an argument without specifying it explicitely as `text_pair=` will be deprecated in future versions." | |
) | |
# for BC | |
if audio is not None: | |
output_kwargs["text_kwargs"]["text_pair"] = audio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, let's do it logger,warning_once and move the warning below, so that users see it only if passing text_pair without indicating text_pair=my_text
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly not a big fan of this - we shouldn't be using kwargs for a hack for which they are not advertised. Will take a look this afternoon and try to suggest something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well I'm kind of out of ideas, I'll trust you on finding something clean, at worst we can do as you say but add a deprecation cycle for a few versions later. Closest I could find that does modify the signature is simply by capturing all extra args, like so
def __call__(
self,
images: Optional[ImageInput] = None,
text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
*args,
audio=None,
videos=None,
**kwargs: Unpack[UdopProcessorKwargs],
) -> BatchFeature:
"""
This method first forwards the `images` argument to [`~UdopImageProcessor.__call__`]. In case
[`UdopImageProcessor`] was initialized with `apply_ocr` set to `True`, it passes the obtained words and
bounding boxes along with the additional arguments to [`~UdopTokenizer.__call__`] and returns the output,
together with the prepared `pixel_values`. In case [`UdopImageProcessor`] was initialized with `apply_ocr` set
to `False`, it passes the words (`text`/``text_pair`) and `boxes` specified by the user along with the
additional arguments to [`~UdopTokenizer.__call__`] and returns the output, together with the prepared
`pixel_values`.
Alternatively, one can pass `text_target` and `text_pair_target` to prepare the targets of UDOP.
Please refer to the docstring of the above two methods for more information.
"""
# verify input
output_kwargs = self._merge_kwargs(
UdopProcessorKwargs,
tokenizer_init_kwargs=self.tokenizer.init_kwargs,
**kwargs,
)
# for BC, handle unexpected positional arguments
if len(args) > 0:
logger.warning_once(
f"Received unexpected positional arguments. These will be mapped accordingly to `text_pair`."
)
if len(args) == 1:
# if there's one extra positional argument, assume it's `text_pair` for backward compatibility
output_kwargs['text_kwargs']['text_pair'] = args[0]
which feels a bit less hacky if it indeed works. It would also allow to not add extra placeholder args when we have more than one extra arg to take care of, as in this cool work: #32180
However I will take my annual holidays soon so I won't be able to decide more on that, I'll trust you to move on with something incredible anyways as you've done amazing work already @yonigozlan @zucchini-nlp 💜
Btw, i just realized we are swapping input args order: it was text-first and now it will be image-first. AFAIK most people are used to pass text and then image in LLaVa models, without indicating the arg-name ( UPDATE: also, our slow tests for llava (not sure about others) also don't follow the new order, so we should update them |
02a4ec3
to
76bb138
Compare
So this PR is getting bigger than I anticipated :). I think it's close to ready to be merged so I re-requested reviews, but maybe I should break it up in smaller PRs first? cc @zucchini-nlp @molbap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, LGTM!
PreTrainedTokenizerBase, | ||
PreTrainedTokenizerFast, | ||
UdopProcessor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UdpoProcessor is imported below if pytesseract is available, so imo we don't need to add it here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed it from the pytesseract check instead, as there is a strange bug where the line below the class definition (processor_class = UdopProcessor
) will still be executed even if the "requires" are not satisfied, which makes the CI break
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does that mean UDOP has no dependency on pytesseract to run the processor test and will run successfully?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the image_processor (LayoutLMv3ImageProcessor
) depends on pytesseract, but since the import check is already done at the level of LayoutLMv3ImageProcessor
, it doesn't seem to me that it should also be done when importing the processor? Though I'm not sure how these nested requirement checks should be dealt with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see, that's weird because the Tester class has a require_pytesseract
dependency which afaik is same as is_pytesseract_available()
Actually only importing processor has no dependencies, from what I see it doesn't use pytesseract directly. So it should be ok to import it as is, and the tests should be skipped by require_pytesseract
if package is not installed. I'm just curious why that broke, if you have bandwidth to explore that. I couldn't reproduce it by removing UdopProcessor
from general imports
def model_input_names(self): | ||
return ["input_ids", "bbox", "attention_mask", "pixel_values"] | ||
return ["pixel_values", "input_ids", "attention_mask", "bbox"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just noted: why change the order here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the returned object of UdopProcessor from a BatchEncoding to a BatchFeature by updating the encoded images with the encoded text, and not the other way around, which changed the order of the output keys
4e23ee7
to
f5d8507
Compare
Removed Udop from this PR as it has some specific args to handle, so waiting on this #33479 to be merged before opening another PR for it. |
…d by BatchEncoding -> BatchFeature
…s, add BC for text_pair as arg for Udop
cb961ad
to
e6ceb28
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice work ❤️
Great to see the combine efforts to make a clean processor interface being propagated to clean up the codebase 🧹
Just a few small comments - main ones about the commented out code
|
||
# @require_vision | ||
# @require_torch | ||
# def test_tokenizer_defaults_preserved_by_kwargs(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To uncomment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes they can even be removed, forgot to do it thanks
|
||
# @require_vision | ||
# @require_torch | ||
# def test_kwargs_overrides_default_tokenizer_kwargs(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here?
@@ -179,261 +179,3 @@ def test_model_input_names(self): | |||
list(inputs.keys()), | |||
["input_ids", "attention_mask", "qformer_input_ids", "qformer_attention_mask", "pixel_values"], | |||
) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So much code deletion 🤩
# Temporary fix for "paddding_side" in init_kwargs | ||
_ = output_kwargs["text_kwargs"].pop("padding_side", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still needed? I can't remember the state of the solution for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it shouldn't be needed anymore! Thanks for catching that :)
return_tensors=return_tensors if images is None else None, | ||
**kwargs, | ||
output_kwargs["text_kwargs"]["add_special_tokens"] = ( | ||
output_kwargs["text_kwargs"]["add_special_tokens"] and add_eos_token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is matching the logic above but this seems like it would produce some very surprising behaviour 👀 (not a comment saying you should change things, just noting)
images=[lowres_img, cats_image], text=[self.prompt, self.prompt], return_tensors="pt", padding=True | ||
).to(torch_device) | ||
|
||
model.train() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why set to training mode here? Is there an assertion on right padding because of this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what happened here as I don't think I've made those changes 😅, maybe the rebase went wrong at some point. I will remove all that.
images=[lowres_img, cats_image], text=[self.prompt, self.prompt], return_tensors="pt", padding=True | ||
).to(torch_device) | ||
|
||
model.train() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same q here about forcing into training mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
What does this PR do?
Adds uniformized processors kwargs following #31911 for the following image-text-to-text models:
I will open a separate PR for Idefics/2 as their processors are quite different from the others.
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@molbap @zucchini-nlp @amyeroberts