Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uniformize kwargs for chameleon processor #32181

Merged
merged 12 commits into from
Sep 26, 2024

Conversation

leloykun
Copy link
Contributor

@leloykun leloykun commented Jul 24, 2024

What does this PR do?

Uniformizes kwargs of Chameleon processors as discussed in #31911

Currently a draft. Will set as ready for review once this PR gets merged: #32013 The other PR will take longer to complete, but this can now be merged.

Fixes # (issue)

Who can review?

@zucchini-nlp @molbap

@zucchini-nlp zucchini-nlp mentioned this pull request Aug 7, 2024
40 tasks
@leloykun leloykun force-pushed the fc--uniform-kwargs-chameleon branch from b184e46 to 2f4163a Compare August 16, 2024 07:38
@leloykun leloykun marked this pull request as ready for review August 16, 2024 07:40
@leloykun
Copy link
Contributor Author

@zucchini-nlp this should now also be ready for review

Copy link
Contributor

@molbap molbap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! Just one extra default I don't get other than that looks fine

src/transformers/models/chameleon/processing_chameleon.py Outdated Show resolved Hide resolved
@leloykun leloykun requested a review from molbap August 16, 2024 09:38
Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! I think we have to swap args order and it will be ready to merge

Comment on lines 17 to 21
if isinstance(component_class_name, tuple):
if "_fast" in component_class_name[0]:
component_class_name = component_class_name[0]
else:
component_class_name = component_class_name[1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as in the other PR, why we need to overwrite and look for fastTokenizer? Or is it FastImageProcessor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some of the common tests error-out with the base tokenizer

I've yet to investigate why, but it's likely unrelated to this PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, nice to see what exactly is causing the error, in case there is a bug in tokenizers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I remember now

The slow, LlamaTokenizer expects the vocab file to be present but it's neither in the official repo nor does it get saved to the temp dir when we do processor.save_pretrained(self.tmpdirname) in setUp

imma add this as a comment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, chameleon never had a slow tokenizer. Oke, in that case we can and prob should remove an option for slow tokenizer in chameleon here so that the tuple is (None, FastTokenizer)

"chameleon",
(
"LlamaTokenizer" if is_sentencepiece_available() else None,
"LlamaTokenizerFast" if is_tokenizers_available() else None,
),

And then add a check for Noneness in general get_component

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just removed Chameleon's slow tokenizer & removed the custom get_component in ChameleonProcessorTest (we don't need the extra check cuz there's only one tokenizer left

src/transformers/models/chameleon/processing_chameleon.py Outdated Show resolved Hide resolved
@@ -233,13 +236,14 @@ def test_unstructured_kwargs_batched(self):
images=image_input,
return_tensors="pt",
size={"height": 214, "width": 214},
crop_size={"height": 214, "width": 214},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yonigozlan i think you removed crop_size from common tests and it had smth to do with some image processors accepting/not accepting certain kwargs?

Copy link
Member

@yonigozlan yonigozlan Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zucchini-nlp Yes but actually it would be nice to have both here. @molbap had some CI tests crash because crop_size was removed here and the image_processor had do_center_crop set to True by default which canceled out size. Having both would handle cases where either do_center_crop is set to True in the image_processor by default, or crop_size is not supported by the image_processor.
So I am for keeping this and merging this PR before some other kwargs uniformization PRs

@@ -24,7 +24,7 @@

from transformers import (
ChameleonConfig,
ChameleonForCausalLM,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw @zucchini-nlp we might need to increase prio for this PR because of this

I have this change in my other PR too, but I forgot we haven't merged it yet

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was out for a while. Yes, I think some other contributor also reported the issue and wanted to open a PR to fix the conversion script. Feel free to open a PR if there isn't any, as this issue isn't at all related to processor kwargs

@yonigozlan
Copy link
Member

Thanks so much for your contribution @leloykun ! This PR was a bit outdated compared to main so I rebased it and modified some other nits, but otherwise it seems all good to me.
@molbap @amyeroberts this should be ready for review!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding!

Overall looks good, just two main things to address:

  • Removing tests/models/chameleon/test_processor_chameleon.py
  • Undoing the removal of the fallback to the slow tokenizer

"LlamaTokenizer" if is_sentencepiece_available() else None,
"LlamaTokenizerFast" if is_tokenizers_available() else None,
),
(None, "LlamaTokenizerFast" if is_tokenizers_available() else None),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove the slow tokenizer here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a vocab file missing if I understood correctly, but I will see if it can be added back

@@ -45,7 +61,7 @@ class ChameleonProcessor(ProcessorMixin):
"""

attributes = ["image_processor", "tokenizer"]
tokenizer_class = ("LlamaTokenizer", "LlamaTokenizerFast")
tokenizer_class = "LlamaTokenizerFast"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here - why remove the slow tokenizer option?

Comment on lines 96 to 102
return_tensors (`str` or [`~utils.TensorType`], *optional*):
If set, will return tensors of a particular framework. Acceptable values are:

- `'tf'`: Return TensorFlow `tf.constant` objects.
- `'pt'`: Return PyTorch `torch.Tensor` objects.
- `'np'`: Return NumPy `np.ndarray` objects.
- `'jax'`: Return JAX `jnp.ndarray` objects.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should stay in the docstring for the moment, as it's required for users to get the right output from the processor to pass to the model

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To remove?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no custom tests but it still inherits the tests from ProcessorTesterMixin

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I reviewed too quickly and thought this was a scrap file. We should keep and:

  • Update the checkpoint
  • Add a copyright header

@leloykun
Copy link
Contributor Author

the changes look good to me

thanks for the help @yonigozlan!!

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating!

tests/models/chameleon/test_processor_chameleon.py Outdated Show resolved Hide resolved
@yonigozlan yonigozlan merged commit 0a21381 into huggingface:main Sep 26, 2024
17 checks passed
BenjaminBossan pushed a commit to BenjaminBossan/transformers that referenced this pull request Sep 30, 2024
* uniformize kwargs of Chameleon

* fix linter nit

* rm stride default

* add tests for chameleon processor

* fix tests

* add comment on get_component

* rm Chameleon's slow tokenizer

* add check order images text + nit

* update docs and tests

* Fix LlamaTokenizer tests

* fix gated repo access

* fix wrong import

---------

Co-authored-by: yonigozlan <[email protected]>
amyeroberts pushed a commit to amyeroberts/transformers that referenced this pull request Oct 2, 2024
* uniformize kwargs of Chameleon

* fix linter nit

* rm stride default

* add tests for chameleon processor

* fix tests

* add comment on get_component

* rm Chameleon's slow tokenizer

* add check order images text + nit

* update docs and tests

* Fix LlamaTokenizer tests

* fix gated repo access

* fix wrong import

---------

Co-authored-by: yonigozlan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants