Uniformize kwargs for chameleon processor #32181

leloykun · 2024-07-24T09:35:33Z

What does this PR do?

Uniformizes kwargs of Chameleon processors as discussed in #31911

~~Currently a draft. Will set as ready for review once this PR gets merged: #32013~~ The other PR will take longer to complete, but this can now be merged.

Fixes # (issue)

Uniform kwargs for processors #31911

Who can review?

@zucchini-nlp @molbap

leloykun · 2024-08-16T07:42:46Z

@zucchini-nlp this should now also be ready for review

molbap

Thanks for the contribution! Just one extra default I don't get other than that looks fine

src/transformers/models/chameleon/processing_chameleon.py

zucchini-nlp

Great job! I think we have to swap args order and it will be ready to merge

zucchini-nlp · 2024-08-19T06:35:59Z

tests/models/chameleon/test_processor_chameleon.py

+        if isinstance(component_class_name, tuple):
+            if "_fast" in component_class_name[0]:
+                component_class_name = component_class_name[0]
+            else:
+                component_class_name = component_class_name[1]


Same question as in the other PR, why we need to overwrite and look for fastTokenizer? Or is it FastImageProcessor

some of the common tests error-out with the base tokenizer

I've yet to investigate why, but it's likely unrelated to this PR

Yes, nice to see what exactly is causing the error, in case there is a bug in tokenizers

ah, I remember now

The slow, LlamaTokenizer expects the vocab file to be present but it's neither in the official repo nor does it get saved to the temp dir when we do processor.save_pretrained(self.tmpdirname) in setUp

imma add this as a comment

Ah right, chameleon never had a slow tokenizer. Oke, in that case we can and prob should remove an option for slow tokenizer in chameleon here so that the tuple is (None, FastTokenizer)

transformers/src/transformers/models/auto/tokenization_auto.py

Lines 111 to 115 in 85345bb

"chameleon",

(

"LlamaTokenizer" if is_sentencepiece_available() else None,

"LlamaTokenizerFast" if is_tokenizers_available() else None,

),

And then add a check for Noneness in general get_component

just removed Chameleon's slow tokenizer & removed the custom get_component in ChameleonProcessorTest (we don't need the extra check cuz there's only one tokenizer left

src/transformers/models/chameleon/processing_chameleon.py

zucchini-nlp · 2024-08-19T06:38:57Z

tests/test_processing_common.py

@@ -233,13 +236,14 @@ def test_unstructured_kwargs_batched(self):
            images=image_input,
            return_tensors="pt",
            size={"height": 214, "width": 214},
+            crop_size={"height": 214, "width": 214},


@yonigozlan i think you removed crop_size from common tests and it had smth to do with some image processors accepting/not accepting certain kwargs?

@zucchini-nlp Yes but actually it would be nice to have both here. @molbap had some CI tests crash because crop_size was removed here and the image_processor had do_center_crop set to True by default which canceled out size. Having both would handle cases where either do_center_crop is set to True in the image_processor by default, or crop_size is not supported by the image_processor.
So I am for keeping this and merging this PR before some other kwargs uniformization PRs

leloykun · 2024-08-20T17:12:15Z

src/transformers/models/chameleon/convert_chameleon_weights_to_hf.py

@@ -24,7 +24,7 @@

 from transformers import (
    ChameleonConfig,
-    ChameleonForCausalLM,


btw @zucchini-nlp we might need to increase prio for this PR because of this

I have this change in my other PR too, but I forgot we haven't merged it yet

Sorry, I was out for a while. Yes, I think some other contributor also reported the issue and wanted to open a PR to fix the conversion script. Feel free to open a PR if there isn't any, as this issue isn't at all related to processor kwargs

yonigozlan · 2024-09-23T19:03:37Z

Thanks so much for your contribution @leloykun ! This PR was a bit outdated compared to main so I rebased it and modified some other nits, but otherwise it seems all good to me.
@molbap @amyeroberts this should be ready for review!

HuggingFaceDocBuilderDev · 2024-09-23T19:22:30Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

amyeroberts

Thanks for adding!

Overall looks good, just two main things to address:

Removing tests/models/chameleon/test_processor_chameleon.py
Undoing the removal of the fallback to the slow tokenizer

amyeroberts · 2024-09-24T09:48:56Z

src/transformers/models/auto/tokenization_auto.py

-                    "LlamaTokenizer" if is_sentencepiece_available() else None,
-                    "LlamaTokenizerFast" if is_tokenizers_available() else None,
-                ),
+                (None, "LlamaTokenizerFast" if is_tokenizers_available() else None),


Why remove the slow tokenizer here?

There was a vocab file missing if I understood correctly, but I will see if it can be added back

amyeroberts · 2024-09-24T09:49:56Z

src/transformers/models/chameleon/processing_chameleon.py

@@ -45,7 +61,7 @@ class ChameleonProcessor(ProcessorMixin):
    """

    attributes = ["image_processor", "tokenizer"]
-    tokenizer_class = ("LlamaTokenizer", "LlamaTokenizerFast")
+    tokenizer_class = "LlamaTokenizerFast"


Same here - why remove the slow tokenizer option?

amyeroberts · 2024-09-24T09:50:27Z

src/transformers/models/chameleon/processing_chameleon.py

-            return_tensors (`str` or [`~utils.TensorType`], *optional*):
-                If set, will return tensors of a particular framework. Acceptable values are:
-
-                - `'tf'`: Return TensorFlow `tf.constant` objects.
-                - `'pt'`: Return PyTorch `torch.Tensor` objects.
-                - `'np'`: Return NumPy `np.ndarray` objects.
-                - `'jax'`: Return JAX `jnp.ndarray` objects.


This should stay in the docstring for the moment, as it's required for users to get the right output from the processor to pass to the model

amyeroberts · 2024-09-24T09:50:40Z

tests/models/chameleon/test_processor_chameleon.py

There's no custom tests but it still inherits the tests from ProcessorTesterMixin

Sorry, I reviewed too quickly and thought this was a scrap file. We should keep and:

Update the checkpoint

Add a copyright header

leloykun · 2024-09-24T13:37:18Z

the changes look good to me

thanks for the help @yonigozlan!!

amyeroberts

Thanks for updating!

tests/models/chameleon/test_processor_chameleon.py

* uniformize kwargs of Chameleon * fix linter nit * rm stride default * add tests for chameleon processor * fix tests * add comment on get_component * rm Chameleon's slow tokenizer * add check order images text + nit * update docs and tests * Fix LlamaTokenizer tests * fix gated repo access * fix wrong import --------- Co-authored-by: yonigozlan <[email protected]>

zucchini-nlp mentioned this pull request Aug 7, 2024

Uniform kwargs for processors #31911

Open

40 tasks

leloykun force-pushed the fc--uniform-kwargs-chameleon branch from b184e46 to 2f4163a Compare August 16, 2024 07:38

leloykun marked this pull request as ready for review August 16, 2024 07:40

molbap reviewed Aug 16, 2024

View reviewed changes

src/transformers/models/chameleon/processing_chameleon.py Outdated Show resolved Hide resolved

leloykun requested a review from molbap August 16, 2024 09:38

zucchini-nlp reviewed Aug 19, 2024

View reviewed changes

leloykun mentioned this pull request Aug 20, 2024

Uniformize model processors (models w/o special arg names) #32845

Open

12 tasks

leloykun commented Aug 20, 2024

View reviewed changes

leloykun requested a review from zucchini-nlp August 20, 2024 17:17

leloykun force-pushed the fc--uniform-kwargs-chameleon branch from 0dbc570 to b252643 Compare August 24, 2024 11:14

leloykun and others added 8 commits September 23, 2024 18:03

uniformize kwargs of Chameleon

d8e65c0

fix linter nit

a2f71e6

rm stride default

1595513

add tests for chameleon processor

47169c1

fix tests

0d25ae6

add comment on get_component

0ee73cd

rm Chameleon's slow tokenizer

0178af6

add check order images text + nit

5082630

yonigozlan force-pushed the fc--uniform-kwargs-chameleon branch from b252643 to 5082630 Compare September 23, 2024 18:58

yonigozlan mentioned this pull request Sep 24, 2024

Uniformize kwargs for Idefics/2 processors #32568

Merged

7 tasks

update docs and tests

272ff5c

amyeroberts reviewed Sep 24, 2024

View reviewed changes

yonigozlan added 2 commits September 24, 2024 14:28

Fix LlamaTokenizer tests

71b9a06

fix gated repo access

c3bcb7a

yonigozlan requested a review from amyeroberts September 24, 2024 16:57

amyeroberts approved these changes Sep 25, 2024

View reviewed changes

tests/models/chameleon/test_processor_chameleon.py Outdated Show resolved Hide resolved

fix wrong import

f00aeb7

yonigozlan merged commit 0a21381 into huggingface:main Sep 26, 2024
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uniformize kwargs for chameleon processor #32181

Uniformize kwargs for chameleon processor #32181

leloykun commented Jul 24, 2024 •

edited

Loading

leloykun commented Aug 16, 2024

molbap left a comment

zucchini-nlp left a comment

zucchini-nlp Aug 19, 2024

leloykun Aug 19, 2024

zucchini-nlp Aug 19, 2024

leloykun Aug 19, 2024

zucchini-nlp Aug 20, 2024

leloykun Aug 20, 2024

zucchini-nlp Aug 19, 2024

yonigozlan Aug 20, 2024 •

edited

Loading

leloykun Aug 20, 2024

zucchini-nlp Sep 3, 2024

yonigozlan commented Sep 23, 2024

HuggingFaceDocBuilderDev commented Sep 23, 2024

amyeroberts left a comment •

edited

Loading

amyeroberts Sep 24, 2024

yonigozlan Sep 24, 2024

amyeroberts Sep 24, 2024

amyeroberts Sep 24, 2024

amyeroberts Sep 24, 2024

yonigozlan Sep 24, 2024

amyeroberts Sep 24, 2024

leloykun commented Sep 24, 2024

amyeroberts left a comment

	"chameleon",
	(
	"LlamaTokenizer" if is_sentencepiece_available() else None,
	"LlamaTokenizerFast" if is_tokenizers_available() else None,
	),

Uniformize kwargs for chameleon processor #32181

Uniformize kwargs for chameleon processor #32181

Conversation

leloykun commented Jul 24, 2024 • edited Loading

What does this PR do?

Who can review?

leloykun commented Aug 16, 2024

molbap left a comment

Choose a reason for hiding this comment

zucchini-nlp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yonigozlan Aug 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yonigozlan commented Sep 23, 2024

HuggingFaceDocBuilderDev commented Sep 23, 2024

amyeroberts left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leloykun commented Sep 24, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

leloykun commented Jul 24, 2024 •

edited

Loading

yonigozlan Aug 20, 2024 •

edited

Loading

amyeroberts left a comment •

edited

Loading