-
Notifications
You must be signed in to change notification settings - Fork 26.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SigLIP] Add fast tokenizer #29969
base: main
Are you sure you want to change the base?
[SigLIP] Add fast tokenizer #29969
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@ArthurZucker I'm down to these 3 tests failing:
but I don't really know how to fix these. Are you able to look into these? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_tokenization_python_rust_equals
is the only one you really need. The others are not well designed TBH
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be removed
|
||
self.vocab_file = vocab_file | ||
|
||
@property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lots of copied from are missing here as well
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
is this getting merged anytime soon? |
Comments need to be adressed. cc @itazap if you want to take this over! |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Hi @itazap do you have bandwidth to work on this? There were only 2 tests remain to be fixed (for which I'd need some guidance) |
@NielsRogge Thank you for your patience! I'm looking into the failing tests now |
@NielsRogge Upon further inspection of the failing tests, the rust tokenizer is not equal to the python tokenizer. There are some key issues/differences, including:
list_normalizers.append(normalizers.Replace(Regex(r"[" + re.escape(string.punctuation) + "]"), ""))
def pre_tokenizer(self, replacement, add_prefix_space):
prepend_scheme = _get_prepend_scheme(add_prefix_space, self.original_tokenizer)
return pre_tokenizers.Sequence(
[
pre_tokenizers.WhitespaceSplit(),
pre_tokenizers.Metaspace(replacement=replacement, prepend_scheme=prepend_scheme),
]
)
This should be a good starting point in regards to where to debug. Let me know if I can answer any questions about the above! Also, may be helpful to merge the latest main code into this branch! |
Thanks for looking into it, |
@NielsRogge in that function, the |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Hi @itazap would you be interested to work on this? |
@NielsRogge Yes I can look into this! 😄 I may follow up to ask about expected behaviour of siglip 👀 |
@NielsRogge are you able to please rebase when you have a chance? I don't have permission to push a rebase on this! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THanks, I don't think we need a class, as the key part is to properly convert to a PreTrainedTokenizerFast!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure we even need that class no? We could just use a PreTrainedTokenizerFast
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I didn't consider we could do that! Is there an example I can reference? Not sure what to do with the functions copied over from T5 here.
Also, looking more into the functions here, would it be better to move common functions like save_vocabulary
(duplicated in 15 _fast
files) to PreTrainedTokenizerFast
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a question, wouldn't that confuse users? if there's no dedicated class with the name of the model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NielsRogge yes I see your point I also had the same thought. Do we have other fast models we support without a dedicated class?@ArthurZucker
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have quite a lot of other models that just use PreTrainedTokenizerFast
, including Llama (llama v3) , all mambas etc. Tokenizers are more prone to change than models (you could have Mamba with LlamaTokenizer) so it makes more sense to deprecate slow and model-specific ones
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need them they are build on the layer of PreTrainedTokenizerFast
+ we can embed stuff inside the tokenizer fast itelsf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArthurZucker Sorry, not really understanding if you mean to just remove this file entirely and not worry about the functions? (or do we need to embed them somewhere?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah just remove it entirely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArthurZucker and would we have to add a tokenizer.json
to the hub?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added the fast tokenizer.json
file to https://huggingface.co/google/siglip-base-patch16-224/discussions/4
239665e
to
e73fa01
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going to request for merging commits on the hub for the tokenizer.json files!
Sounds good! ˜~ |
Hey! 🤗 Thanks for your contribution to the Before merging this pull request, slow tests CI should be triggered. To enable this:
(For maintainers) The documentation for slow tests CI on PRs is here. |
Co-authored-by: Arthur <[email protected]>
Co-authored-by: Arthur <[email protected]>
Co-authored-by: Arthur <[email protected]>
@@ -201,6 +201,11 @@ def test_PreTrainedTokenizerFast_from_pretrained(self): | |||
self.assertEqual(tokenizer.padding_side, "right") | |||
self.assertEqual(tokenizer.truncation_side, "right") | |||
|
|||
def test_PreTrainedTokenizerFast_inferred(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ArthurZucker added this test for #33751 as Siglip would be the first model that can test this (#33751 (comment))
"SiglipTokenizer" if is_sentencepiece_available() else None, | ||
"PreTrainedTokenizerFast" if is_tokenizers_available() else None, | ||
), | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ref for #33751 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's revert the small change to llama test and good to merge!
6473a84
to
e975d96
Compare
@ArthurZucker @NielsRogge we cannot merge until we merge a feature to support loading Fast without a Fast tokenizer class specified in the tokenizer_config file! need either #34212 or #33751 merged first! (that is why this PR is failing, the test fails as the feature is not yet unsupported) |
No worries, let's wait a bit to refactor 🤗 |
What does this PR do?
Fixes #29925.
🔴 Breaking change! Updates
SiglipTokenizer
, specificallystrip
behaviour intokenize
To do: