-
Notifications
You must be signed in to change notification settings - Fork 26.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uniformize kwargs for chameleon processor #32181
Uniformize kwargs for chameleon processor #32181
Conversation
b184e46
to
2f4163a
Compare
@zucchini-nlp this should now also be ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution! Just one extra default I don't get other than that looks fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job! I think we have to swap args order and it will be ready to merge
if isinstance(component_class_name, tuple): | ||
if "_fast" in component_class_name[0]: | ||
component_class_name = component_class_name[0] | ||
else: | ||
component_class_name = component_class_name[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as in the other PR, why we need to overwrite and look for fastTokenizer? Or is it FastImageProcessor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some of the common tests error-out with the base tokenizer
I've yet to investigate why, but it's likely unrelated to this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, nice to see what exactly is causing the error, in case there is a bug in tokenizers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I remember now
The slow, LlamaTokenizer expects the vocab file to be present but it's neither in the official repo nor does it get saved to the temp dir when we do processor.save_pretrained(self.tmpdirname)
in setUp
imma add this as a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, chameleon never had a slow tokenizer. Oke, in that case we can and prob should remove an option for slow tokenizer in chameleon here so that the tuple is (None, FastTokenizer)
transformers/src/transformers/models/auto/tokenization_auto.py
Lines 111 to 115 in 85345bb
"chameleon", | |
( | |
"LlamaTokenizer" if is_sentencepiece_available() else None, | |
"LlamaTokenizerFast" if is_tokenizers_available() else None, | |
), |
And then add a check for Noneness
in general get_component
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just removed Chameleon's slow tokenizer & removed the custom get_component
in ChameleonProcessorTest
(we don't need the extra check cuz there's only one tokenizer left
tests/test_processing_common.py
Outdated
@@ -233,13 +236,14 @@ def test_unstructured_kwargs_batched(self): | |||
images=image_input, | |||
return_tensors="pt", | |||
size={"height": 214, "width": 214}, | |||
crop_size={"height": 214, "width": 214}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yonigozlan i think you removed crop_size
from common tests and it had smth to do with some image processors accepting/not accepting certain kwargs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zucchini-nlp Yes but actually it would be nice to have both here. @molbap had some CI tests crash because crop_size
was removed here and the image_processor had do_center_crop
set to True by default which canceled out size
. Having both would handle cases where either do_center_crop
is set to True in the image_processor by default, or crop_size
is not supported by the image_processor.
So I am for keeping this and merging this PR before some other kwargs uniformization PRs
@@ -24,7 +24,7 @@ | |||
|
|||
from transformers import ( | |||
ChameleonConfig, | |||
ChameleonForCausalLM, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw @zucchini-nlp we might need to increase prio for this PR because of this
I have this change in my other PR too, but I forgot we haven't merged it yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I was out for a while. Yes, I think some other contributor also reported the issue and wanted to open a PR to fix the conversion script. Feel free to open a PR if there isn't any, as this issue isn't at all related to processor kwargs
0dbc570
to
b252643
Compare
b252643
to
5082630
Compare
Thanks so much for your contribution @leloykun ! This PR was a bit outdated compared to main so I rebased it and modified some other nits, but otherwise it seems all good to me. |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding!
Overall looks good, just two main things to address:
- Removing tests/models/chameleon/test_processor_chameleon.py
- Undoing the removal of the fallback to the slow tokenizer
"LlamaTokenizer" if is_sentencepiece_available() else None, | ||
"LlamaTokenizerFast" if is_tokenizers_available() else None, | ||
), | ||
(None, "LlamaTokenizerFast" if is_tokenizers_available() else None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove the slow tokenizer here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a vocab file missing if I understood correctly, but I will see if it can be added back
@@ -45,7 +61,7 @@ class ChameleonProcessor(ProcessorMixin): | |||
""" | |||
|
|||
attributes = ["image_processor", "tokenizer"] | |||
tokenizer_class = ("LlamaTokenizer", "LlamaTokenizerFast") | |||
tokenizer_class = "LlamaTokenizerFast" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here - why remove the slow tokenizer option?
return_tensors (`str` or [`~utils.TensorType`], *optional*): | ||
If set, will return tensors of a particular framework. Acceptable values are: | ||
|
||
- `'tf'`: Return TensorFlow `tf.constant` objects. | ||
- `'pt'`: Return PyTorch `torch.Tensor` objects. | ||
- `'np'`: Return NumPy `np.ndarray` objects. | ||
- `'jax'`: Return JAX `jnp.ndarray` objects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should stay in the docstring for the moment, as it's required for users to get the right output from the processor to pass to the model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no custom tests but it still inherits the tests from ProcessorTesterMixin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I reviewed too quickly and thought this was a scrap file. We should keep and:
- Update the checkpoint
- Add a copyright header
the changes look good to me thanks for the help @yonigozlan!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating!
* uniformize kwargs of Chameleon * fix linter nit * rm stride default * add tests for chameleon processor * fix tests * add comment on get_component * rm Chameleon's slow tokenizer * add check order images text + nit * update docs and tests * Fix LlamaTokenizer tests * fix gated repo access * fix wrong import --------- Co-authored-by: yonigozlan <[email protected]>
* uniformize kwargs of Chameleon * fix linter nit * rm stride default * add tests for chameleon processor * fix tests * add comment on get_component * rm Chameleon's slow tokenizer * add check order images text + nit * update docs and tests * Fix LlamaTokenizer tests * fix gated repo access * fix wrong import --------- Co-authored-by: yonigozlan <[email protected]>
What does this PR do?
Uniformizes kwargs of Chameleon processors as discussed in #31911
Currently a draft. Will set as ready for review once this PR gets merged: #32013The other PR will take longer to complete, but this can now be merged.Fixes # (issue)
Who can review?
@zucchini-nlp @molbap