-
Notifications
You must be signed in to change notification settings - Fork 26.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uniformize kwargs for chameleon processor #32181
base: main
Are you sure you want to change the base?
Conversation
b184e46
to
2f4163a
Compare
@zucchini-nlp this should now also be ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution! Just one extra default I don't get other than that looks fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job! I think we have to swap args order and it will be ready to merge
if isinstance(component_class_name, tuple): | ||
if "_fast" in component_class_name[0]: | ||
component_class_name = component_class_name[0] | ||
else: | ||
component_class_name = component_class_name[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as in the other PR, why we need to overwrite and look for fastTokenizer? Or is it FastImageProcessor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some of the common tests error-out with the base tokenizer
I've yet to investigate why, but it's likely unrelated to this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, nice to see what exactly is causing the error, in case there is a bug in tokenizers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I remember now
The slow, LlamaTokenizer expects the vocab file to be present but it's neither in the official repo nor does it get saved to the temp dir when we do processor.save_pretrained(self.tmpdirname)
in setUp
imma add this as a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, chameleon never had a slow tokenizer. Oke, in that case we can and prob should remove an option for slow tokenizer in chameleon here so that the tuple is (None, FastTokenizer)
transformers/src/transformers/models/auto/tokenization_auto.py
Lines 111 to 115 in 85345bb
"chameleon", | |
( | |
"LlamaTokenizer" if is_sentencepiece_available() else None, | |
"LlamaTokenizerFast" if is_tokenizers_available() else None, | |
), |
And then add a check for Noneness
in general get_component
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just removed Chameleon's slow tokenizer & removed the custom get_component
in ChameleonProcessorTest
(we don't need the extra check cuz there's only one tokenizer left
@@ -233,13 +236,14 @@ def test_unstructured_kwargs_batched(self): | |||
images=image_input, | |||
return_tensors="pt", | |||
size={"height": 214, "width": 214}, | |||
crop_size={"height": 214, "width": 214}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yonigozlan i think you removed crop_size
from common tests and it had smth to do with some image processors accepting/not accepting certain kwargs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zucchini-nlp Yes but actually it would be nice to have both here. @molbap had some CI tests crash because crop_size
was removed here and the image_processor had do_center_crop
set to True by default which canceled out size
. Having both would handle cases where either do_center_crop
is set to True in the image_processor by default, or crop_size
is not supported by the image_processor.
So I am for keeping this and merging this PR before some other kwargs uniformization PRs
@@ -24,7 +24,7 @@ | |||
|
|||
from transformers import ( | |||
ChameleonConfig, | |||
ChameleonForCausalLM, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw @zucchini-nlp we might need to increase prio for this PR because of this
I have this change in my other PR too, but I forgot we haven't merged it yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I was out for a while. Yes, I think some other contributor also reported the issue and wanted to open a PR to fix the conversion script. Feel free to open a PR if there isn't any, as this issue isn't at all related to processor kwargs
0dbc570
to
b252643
Compare
What does this PR do?
Uniformizes kwargs of Chameleon processors as discussed in #31911
Currently a draft. Will set as ready for review once this PR gets merged: #32013The other PR will take longer to complete, but this can now be merged.Fixes # (issue)
Who can review?
@zucchini-nlp @molbap