-
Notifications
You must be signed in to change notification settings - Fork 26.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uniformize model processors (models w/o special arg names) #32845
base: main
Are you sure you want to change the base?
Uniformize model processors (models w/o special arg names) #32845
Conversation
if output_kwargs["text_kwargs"].get("visual_prompt") is not None and audio is not None: | ||
raise ValueError( | ||
"You cannot provide `visual_prompt` as a positional argument and as a keyword argument at the same time." | ||
"Please provide it only as a keyword argument (i.e. `visual_prompt=...`)." | ||
) | ||
if "visual_prompt" not in output_kwargs["text_kwargs"]: | ||
warnings.warn( | ||
"No `visual_prompt` kwarg was detected. The use of `visual_prompt` as an argument without specifying it explicitely as `visual_prompt=` will be deprecated in future versions." | ||
) | ||
# For backwards compatibility, we reuse `audio` as `visual_prompt` in case | ||
# downstream users passed it as a positional argument | ||
if audio is not None: | ||
output_kwargs["text_kwargs"]["visual_prompt"] = audio | ||
|
||
visual_prompt = output_kwargs["text_kwargs"].pop("visual_prompt", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same remark as in #32841 , we should not be using kwargs for a purpose other than the one advertised!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8a2a2c8
to
0e8e99e
Compare
…lipvideo, llava_next, llava_next_video, siglip, video_llava, vilt
1380596
to
ce9cc73
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this awesome work! I left some general comments, mostly nits about cleaning up docstring and clarification questions for my own uderstanding.
Additionally, we have to return BatchFeature
and not BatchEncoding
in processor's call
and Chameleon needs to be removed from PR, as it now has its own PR
images: Optional[VideoInput] = None, | ||
text: Optional[Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]] = None, | ||
audio=None, | ||
videos=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this is quite interesting since we accept videos
as argument but we still leave it as images = VideoInput
.
Oke, let's leave it for BC and I need to spend some time on making new video-models more standard, currently some work with images
and some with videos
. But that will not be soon I guess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zucchini-nlp I've made these changes:
- we now allow the user to use
videos
instead ofimages
- using
images
now results in a future warning - using both
videos
andimages
now results in an error
src/transformers/models/instructblipvideo/processing_instructblipvideo.py
Show resolved
Hide resolved
if isinstance(component_class_name, tuple): | ||
if "_fast" in component_class_name[0]: | ||
component_class_name = component_class_name[0] | ||
else: | ||
component_class_name = component_class_name[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not very clear why we had to overwrite this, is this because we need to work only with FastTokenizers?
If that's the case, usually there should be no difference in how text is encoded with fast vs slow tokenizers, so we might rather dig why there's a difference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll make the Chameleon-related changes in this PR: #32181
After that gets merged, I'll rebase this to main
@zucchini-nlp I just added 4 more models:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, thanks for adding them! I left some general questions but I think we will first focus and merge a smaller PR, to decide on how to handle edge cases we've been discussing
return_token_type_ids=False, | ||
**kwargs, | ||
) | ||
textual_input = self.tokenizer(text, **output_kwargs["text_kwargs"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this actually doesn't match with the removed lines. Previously we had some kwargs hardcoded, like trucation or padding and now users can set those to any value.
TBH I don't know why these were hardcoded but I'd leave it as is to not break anything. In that case we don't need the pad_to_max_length
kwarg, as it is going to be padded to max-len
no matter what
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved them to the default values in TvpProcessorKwargs
. I think that'd suffice since this part would've error-ed out anyway if downstream users passed them as kwargs too (duplicate args error).
return_tensors = output_kwargs["common_kwargs"].get("return_tensors") | ||
if text is not None and videos is not None: | ||
encoding["pixel_values"] = image_features.pixel_values | ||
return encoding | ||
return BatchFeature(data=dict(**encoding, **image_features), tensor_type=return_tensors) | ||
elif text is not None: | ||
return encoding | ||
return BatchFeature(data=dict(**encoding), tensor_type=return_tensors) | ||
else: | ||
return BatchEncoding(data=dict(**image_features), tensor_type=return_tensors) | ||
return BatchFeature(data=dict(**image_features), tensor_type=return_tensors) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can simplify this and other processors by doing only and assigning
encoding = image_features ={}
# process inputs if present here
BatchFeature(data=dict(**encoding, **image_features), tensor_type=return_tensors)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zucchini-nlp I've implemented it slightly differently, but it's now a lot cleaner than before
@yonigozlan @molbap you guys might wanna take a look too
"videos" not in inspect.signature(self.processor_class.__call__).parameters | ||
or inspect.signature(self.processor_class.__call__).parameters["videos"].annotation == inspect._empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this because of InstructBlipVideo? If it's only that, we better override this test for InstructBlipVideo and leave the check as "video_processor" in classes
When overriding for TVP, there's no need to check as we know if model has video processor or not. Actually seems like it's gonna be skipped because TVP has only image_processor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for:
- InstructBlipVideo
- TVP
- Video Llava, &
- X-Clip
Only Llava-Next-Video actually has a video_processor_class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm leaning on leaving this here cuz it covers more models than the original check + it significantly reduces LoCs
a44a76a
to
9e00f68
Compare
class TvpProcessorTest(ProcessorTesterMixin, unittest.TestCase): | ||
from_pretrained_id = "Jiqing/tiny-random-tvp" | ||
processor_class = TvpProcessor | ||
videos_data_arg_name = "pixel_values" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zucchini-nlp just a heads up cuz I remember you mentioning that you plan on reworking the text-videos models:
the output keys for video data is inconsistent; some uses pixel_values_videos
while some just uses pixel_values
. You'd most likely need to uniformize/standardize them too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's me who started using pixel_values_videos
hehe, thanks for letting know :)
What does this PR do?
TODO:
Fixes # (issue)
Who can review?
@zucchini-nlp @molbap @NielsRogge