Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uniformize kwargs for LLaVa processor and update docs #32858

Merged

Conversation

yonigozlan
Copy link
Member

What does this PR do?

Adds uniformized processors kwargs following #31911 for LLaVa

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@molbap @zucchini-nlp

@yonigozlan yonigozlan mentioned this pull request Aug 16, 2024
40 tasks
@yonigozlan yonigozlan marked this pull request as ready for review August 16, 2024 20:44
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, LGTM!

Left one nit with docstring and we'd need to swap processor args order in test_modeling_llava.py also

# check if images and text inputs are reversed for BC
if (
text is not None
and not isinstance(text[0], str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

text could also just be an str

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check should still work in that case, except if we have an empty string. I will try to think of something cleaner

Copy link
Member Author

@yonigozlan yonigozlan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @zucchini-nlp @leloykun ! Here is a revised version of backward compatibility handling for reversed images and text inputs. the main addition is the _check_reversed_images_text function in processing_utils.py which should better detect if inputs need to be reversed, and can be used by all processors where images and text inputs have been swapped.

@@ -120,7 +118,6 @@ class TextKwargs(TypedDict, total=False):
return_offsets_mapping: Optional[bool]
return_length: Optional[bool]
verbose: Optional[bool]
padding_side: Optional[str]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed padding_side in this PR as it should be one of the first uniformization PR to get merged. This shouldn't break anything (see #32544 (comment))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it will break things in the sense that people who were previously passing in padding_side to the processor (even if it had no effect) would now experience an error, I think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of llava at least, even with padding_side in the base TextKwargs, if the user pass in padding_side to a processor call, there will be an error as the tokenizer inherit from PreTrainedTokenizerFast, and the call function _batch_encode_plus doesn't take in padding_side nor **kwargs :

def _batch_encode_plus(
self,
batch_text_or_text_pairs: Union[
List[TextInput], List[TextInputPair], List[PreTokenizedInput], List[PreTokenizedInputPair]
],
add_special_tokens: bool = True,
padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
truncation_strategy: TruncationStrategy = TruncationStrategy.DO_NOT_TRUNCATE,
max_length: Optional[int] = None,
stride: int = 0,
is_split_into_words: bool = False,
pad_to_multiple_of: Optional[int] = None,
return_tensors: Optional[str] = None,
return_token_type_ids: Optional[bool] = None,
return_attention_mask: Optional[bool] = None,
return_overflowing_tokens: bool = False,
return_special_tokens_mask: bool = False,
return_offsets_mapping: bool = False,
return_length: bool = False,
verbose: bool = True,
split_special_tokens: bool = False,
) -> BatchEncoding:

On that point, if we don't accept kwargs for this function, shouldn't we restrict the kwargs passed in to the processor's modality processors classes? I guess this would require a bit of refactoring of the base processor class, but right now, if a user pass in a kwarg not supported in any ModalityKwarg class, it will be passed as a "common" kwarg to all of the modality processors. If the call functions of the modality processor accepts kwargs, the kwarg will just be ignored, but in the case of this _batch_encode_plus function (there might be others) there will be an error.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see. More generally, we might want to think about being able to control the padding side when calling the tokenizer, as this is something we'd like to be able to handle for e.g. llava and more generally, but this is something for a wider discussion and future PRs cc @zucchini-nlp

In this case, if padding_side can't be accepted by fast tokenizers then yes, I think we should remove it here. I'd do this in a separate PR as it might affect other processors which used TextKwargs and so it would be good to introduce as an atomic change we can easily test, isolate and revert if needed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we are welcome to acceptring padding_side as kwargs in tokenozers (#30447) but I couldn't find a PR for that. I totally agree that it's a nice feature to have and looks easy to implement. I can work in it next week so that we son't have to drop the kwarg in processors. The only thing is that it will be kinda BC breaking, because earlier we ignored the kwarg and now we'll use it to pad on the correct side. WDYT? @amyeroberts

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zucchini-nlp Sounds good to me! Even if it's breaking, I think it's breaking in the right way: correcting a surprising behaviour

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added back padding_side to this PR. Waiting on padding_side to be accepted as kwargs in tokenizers before merging this PR then. Could you ping this PR or me when you open a PR @zucchini-nlp ? Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is it #33385

@@ -830,6 +827,60 @@ class MyProcessingKwargs(ProcessingKwargs, CommonKwargs, TextKwargs, ImagesKwarg
output_kwargs[modality].update(output_kwargs["common_kwargs"])
return output_kwargs

def _check_reversed_images_text(self, images, text):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function might be too problem-specific to be included in the base processor class, but since several VLMs are affected by this issue and it’s only a temporary solution pending deprecation, I thought it might make sense to include it here. It seems more practical than having the function copied across multiple processors, and should also simplify the deprecation process. Of course, If you think there's a better location for this or another approach that would be more suitable, I’m open to suggestions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, for me seems like it's better places in each processor file with copied from statements. I guess there are max 5 or so models that need swapping if i'm not wrong. For changes in general processor file, @ amyeroberts can say more

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just spoke with Amy, and a good middle-ground solution would be to keep this function in processing_utils, but outside of any class, so that models that need it would explicitly import it from processing_utils. This approach avoids cluttering the base Processor class with a very problem-specific function and limits the diffs when adding and deprecating this behavior.

Comment on lines 877 to 878
"You may have used the wrong order for inputs. `images` should be passed before `text`. "
"The `images` and `text` inputs will be swapped."
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include info about deprecating this behavior in a future version here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please, 2-3 major versions from current should be enough

Comment on lines 877 to 878
"You may have used the wrong order for inputs. `images` should be passed before `text`. "
"The `images` and `text` inputs will be swapped."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please, 2-3 major versions from current should be enough

Comment on lines 850 to 853
elif isinstance(t[0], (list, tuple)):
# ... list with an empty list or with a list of strings
return len(t[0]) == 0 or isinstance(t[0][0], str)
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think we work with nested list so texts in processors, that's usually for tokenizers with text-pair. Also the empty list, is it a valid input type?

What can be a valid type is a encoded text, which is missing here

Copy link
Member Author

@yonigozlan yonigozlan Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took an existing _is_valid_text_input function that is defined in several tokenizers, but you are right it should probably be adapted a bit for processors.
Some vlms for object detection do use nested lists of text such as Owlv2 or OmDet-Turbo, so probably better to keep this.
Encoded text are lists of int right? I will add them. Although it seems we usually don't advertise EncodedInput as an acceptable input type for text in processor, so I don't know if that is in purpose or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can also reuse the is_valid_image util

Copy link
Member Author

@yonigozlan yonigozlan Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I'm using valid_images which recursively uses is_valid_image to check for nested images

@@ -830,6 +827,60 @@ class MyProcessingKwargs(ProcessingKwargs, CommonKwargs, TextKwargs, ImagesKwarg
output_kwargs[modality].update(output_kwargs["common_kwargs"])
return output_kwargs

def _check_reversed_images_text(self, images, text):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, for me seems like it's better places in each processor file with copied from statements. I guess there are max 5 or so models that need swapping if i'm not wrong. For changes in general processor file, @ amyeroberts can say more

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding - looking good!

Main comment is to make sure there are tests for the image,text inversion logic

src/transformers/models/llava/processing_llava.py Outdated Show resolved Hide resolved
src/transformers/processing_utils.py Outdated Show resolved Hide resolved
tests/models/llava/test_processor_llava.py Outdated Show resolved Hide resolved
tests/models/llava/test_processor_llava.py Outdated Show resolved Hide resolved
@@ -120,7 +118,6 @@ class TextKwargs(TypedDict, total=False):
return_offsets_mapping: Optional[bool]
return_length: Optional[bool]
verbose: Optional[bool]
padding_side: Optional[str]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it will break things in the sense that people who were previously passing in padding_side to the processor (even if it had no effect) would now experience an error, I think?

src/transformers/processing_utils.py Outdated Show resolved Hide resolved
src/transformers/processing_utils.py Outdated Show resolved Hide resolved
@@ -993,6 +990,59 @@ def apply_chat_template(
)


def _check_reversed_images_text_for_vlms(images, text):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behaviour should be tested

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amyeroberts since this function is in processing_utils, where would the best place be to test it? Or should we create model-specific tests for all models using it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there isn't a module for testing utils in processing_utils then we should add one!

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Changes to llava all look OK to me. I have one question about the tests for valid text inputs.

Main comment is that we should split up the changes to llava and the addition of the verification of the input order for the processors.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

tests/utils/test_processing_utils.py Outdated Show resolved Hide resolved
Comment on lines 145 to 148
self.assertIsInstance(valid_images, torch.Tensor)
self.assertEqual(valid_text, text)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit funny - why check the value of one and the instance type of another?

Copy link
Member Author

@yonigozlan yonigozlan Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to use assert functions from unittest.TestCase, and assertEqual doesn't work with tensors or numpy arrays. Also I thought since only image inputs can be tensors/np arrays, if they are indeed tensors/np arrays it means that the switch (or not) went ok. But I agree it's not very consistent with the rest, and doesn't check if the function modified the inputs during the switch.
I could use self.assertTrue(torch.equal(...)) and self.assertTrue(np.array_equal(...)) instead?

tests/utils/test_processing_utils.py Outdated Show resolved Hide resolved
src/transformers/processing_utils.py Outdated Show resolved Hide resolved
@@ -120,7 +118,6 @@ class TextKwargs(TypedDict, total=False):
return_offsets_mapping: Optional[bool]
return_length: Optional[bool]
verbose: Optional[bool]
padding_side: Optional[str]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see. More generally, we might want to think about being able to control the padding side when calling the tokenizer, as this is something we'd like to be able to handle for e.g. llava and more generally, but this is something for a wider discussion and future PRs cc @zucchini-nlp

In this case, if padding_side can't be accepted by fast tokenizers then yes, I think we should remove it here. I'd do this in a separate PR as it might affect other processors which used TextKwargs and so it would be good to introduce as an atomic change we can easily test, isolate and revert if needed

@yonigozlan yonigozlan force-pushed the uniformize-processors-kwargs-llava branch from 2a7016d to 4dc7ada Compare September 13, 2024 14:50
@yonigozlan
Copy link
Member Author

Now that #33385 has been merged, this should be ready for review!

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great - thanks for making our processors nice and uniform!

Copy link
Contributor

@molbap molbap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching up on reviews, just saw @amyeroberts already approved but I can add my opinion and approve as well 😁 thanks a lot for working on this. I'll get back to the other pending PRs to help the effort as well

@yonigozlan yonigozlan merged commit 2f62146 into huggingface:main Sep 16, 2024
17 checks passed
itazap pushed a commit to NielsRogge/transformers that referenced this pull request Sep 20, 2024
)

* Uniformize kwargs for LlaVa and update docs

* Change order of processor inputs in docstring

* Improve BC support for reversed images and text inputs

* cleanup llava processor call docstring

* Add encoded inputs as valid text inputs in reverse input check, add deprecation version in warning

* Put function check reversed images text outside base processor class

* Refactor _validate_images_text_input_order

* Add ProcessingUtilTester

* fix processing and test_processing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants