Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SigLIP] Add fast tokenizer #29969

Open
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

NielsRogge
Copy link
Contributor

@NielsRogge NielsRogge commented Mar 30, 2024

What does this PR do?

Fixes #29925.

🔴 Breaking change! Updates SiglipTokenizer, specifically strip behaviour in tokenize

To do:

  • fix remaining tests
  • add slow integration test

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@NielsRogge
Copy link
Contributor Author

@ArthurZucker I'm down to these 3 tests failing:

FAILED tests/models/siglip/test_tokenization_siglip.py::SiglipTokenizationTest::test_added_tokens_do_lower_case - AssertionError: 'aaaaa bbbbbb ' == 'aaaaa bbbbbb '
FAILED tests/models/siglip/test_tokenization_siglip.py::SiglipTokenizationTest::test_special_tokens_initialization - AssertionError: Lists differ: [342, 322, 291, 269, 262, 266, 32100, 507, 4290, 1] != [342, 322, 291, 269, 262, 266, 32100, 12936, 1]
FAILED tests/models/siglip/test_tokenization_siglip.py::SiglipTokenizationTest::test_tokenization_python_rust_equals - AssertionError: Sequences differ: [291,[64 chars]62, 232, 141, 158, 232, 141, 163, 232, 142, 16[5335 chars]3, 1] != [291,[64 chars]62, 2, 16577, 266, 2, 1443, 412, 282, 1791, 13[517...

but I don't really know how to fix these. Are you able to look into these?

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_tokenization_python_rust_equals is the only one you really need. The others are not well designed TBH

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be removed


self.vocab_file = vocab_file

@property
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lots of copied from are missing here as well

tests/test_tokenization_common.py Show resolved Hide resolved
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@yxchng
Copy link

yxchng commented May 29, 2024

is this getting merged anytime soon?

@ArthurZucker
Copy link
Collaborator

Comments need to be adressed. cc @itazap if you want to take this over!

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this Jul 8, 2024
@NielsRogge
Copy link
Contributor Author

Hi @itazap do you have bandwidth to work on this? There were only 2 tests remain to be fixed (for which I'd need some guidance)

@itazap
Copy link
Collaborator

itazap commented Jul 9, 2024

@NielsRogge Thank you for your patience! I'm looking into the failing tests now

@itazap itazap reopened this Jul 9, 2024
@itazap
Copy link
Collaborator

itazap commented Jul 9, 2024

@NielsRogge Upon further inspection of the failing tests, the rust tokenizer is not equal to the python tokenizer. There are some key issues/differences, including:

  1. The SiglipConverter.normalizer is dropping all punctuation, including ["<",">"] which need to be kept for the <special> tokens you are adding.
list_normalizers.append(normalizers.Replace(Regex(r"[" + re.escape(string.punctuation) + "]"), ""))
  1. Also in SiglipConvert.normalizer I believe a WhitespaceSplit() should be added like below (see T5)
    def pre_tokenizer(self, replacement, add_prefix_space):
        prepend_scheme = _get_prepend_scheme(add_prefix_space, self.original_tokenizer)
        return pre_tokenizers.Sequence(
            [
                pre_tokenizers.WhitespaceSplit(),
                pre_tokenizers.Metaspace(replacement=replacement, prepend_scheme=prepend_scheme),
            ]
        )
  1. The rust (fast) and python tokenizers tokenize differently, for example if you .tokenize('<special> token) , fast output will include ['token'], while slow will include ['to','ken']. I'm not sure which is expected as I am not too familiar with this model. Do you know which is expected?

  2. [EDIT] do_lower_case should be True for failing test_added_tokens_do_lower_case

This should be a good starting point in regards to where to debug. Let me know if I can answer any questions about the above!

Also, may be helpful to merge the latest main code into this branch!

@NielsRogge
Copy link
Contributor Author

NielsRogge commented Jul 9, 2024

Thanks for looking into it, SiglipTokenizer is exactly the same as T5Tokenizer, except that it includes this function before calling _tokenize (as seen here). I was mainly wondering how this functionality could be incorporated into the fast tokenizer.

@itazap
Copy link
Collaborator

itazap commented Jul 9, 2024

@NielsRogge in that function, the text = text.strip() is causing the discrepancy. In PreTrainerTokenizer.tokenize() , the input string gets split on special tokens. Then, your canonicalize_text may be called on a substring (such as a single token) in cases including special tokens, or on the whole string/sentence, when there are no special tokens (example: ' token' vs 'hey token'). I'm not exactly sure what the canonicalize_text is trying to achieve with the .strip(), but the link you provided mentions only punctuation and lowercasing. without the .strip it should be more consistent but not sure if that will give the same results.

Copy link

github-actions bot commented Aug 3, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this Aug 12, 2024
@NielsRogge
Copy link
Contributor Author

Hi @itazap would you be interested to work on this?

@itazap
Copy link
Collaborator

itazap commented Aug 19, 2024

@NielsRogge Yes I can look into this! 😄 I may follow up to ask about expected behaviour of siglip 👀

@itazap itazap reopened this Aug 19, 2024
@itazap
Copy link
Collaborator

itazap commented Aug 22, 2024

@NielsRogge are you able to please rebase when you have a chance? I don't have permission to push a rebase on this!

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

THanks, I don't think we need a class, as the key part is to properly convert to a PreTrainedTokenizerFast!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to remove

src/transformers/models/siglip/tokenization_siglip.py Outdated Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure we even need that class no? We could just use a PreTrainedTokenizerFast

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I didn't consider we could do that! Is there an example I can reference? Not sure what to do with the functions copied over from T5 here.

Also, looking more into the functions here, would it be better to move common functions like save_vocabulary (duplicated in 15 _fast files) to PreTrainedTokenizerFast?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a question, wouldn't that confuse users? if there's no dedicated class with the name of the model

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NielsRogge yes I see your point I also had the same thought. Do we have other fast models we support without a dedicated class?@ArthurZucker

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have quite a lot of other models that just use PreTrainedTokenizerFast, including Llama (llama v3) , all mambas etc. Tokenizers are more prone to change than models (you could have Mamba with LlamaTokenizer) so it makes more sense to deprecate slow and model-specific ones

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need them they are build on the layer of PreTrainedTokenizerFast + we can embed stuff inside the tokenizer fast itelsf

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArthurZucker Sorry, not really understanding if you mean to just remove this file entirely and not worry about the functions? (or do we need to embed them somewhere?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah just remove it entirely

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArthurZucker and would we have to add a tokenizer.json to the hub?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itazap itazap force-pushed the add_siglip_fast_tokenizer_bis branch from 239665e to e73fa01 Compare September 20, 2024 15:41
Copy link
Collaborator

@itazap itazap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going to request for merging commits on the hub for the tokenizer.json files!

@ArthurZucker
Copy link
Collaborator

Sounds good! ˜~

@HuggingFaceDocBuilderDev

Hey! 🤗 Thanks for your contribution to the transformers library!

Before merging this pull request, slow tests CI should be triggered. To enable this:

  • Add the run-slow label to the PR
  • When your PR is ready for merge and all reviewers' comments have been addressed, push an empty commit with the command [run-slow] followed by a comma separated list of all the models to be tested, i.e. [run_slow] model_to_test_1, model_to_test_2
    • If the pull request affects a lot of models, put at most 10 models in the commit message
  • A transformers maintainer will then approve the workflow to start the tests

(For maintainers) The documentation for slow tests CI on PRs is here.

src/transformers/models/siglip/__init__.py Outdated Show resolved Hide resolved
src/transformers/models/siglip/__init__.py Outdated Show resolved Hide resolved
tests/models/llama/test_tokenization_llama.py Outdated Show resolved Hide resolved
@@ -201,6 +201,11 @@ def test_PreTrainedTokenizerFast_from_pretrained(self):
self.assertEqual(tokenizer.padding_side, "right")
self.assertEqual(tokenizer.truncation_side, "right")

def test_PreTrainedTokenizerFast_inferred(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArthurZucker added this test for #33751 as Siglip would be the first model that can test this (#33751 (comment))

"SiglipTokenizer" if is_sentencepiece_available() else None,
"PreTrainedTokenizerFast" if is_tokenizers_available() else None,
),
),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's revert the small change to llama test and good to merge!

tests/models/llama/test_tokenization_llama.py Outdated Show resolved Hide resolved
@itazap
Copy link
Collaborator

itazap commented Oct 24, 2024

@ArthurZucker @NielsRogge we cannot merge until we merge a feature to support loading Fast without a Fast tokenizer class specified in the tokenizer_config file! need either #34212 or #33751 merged first! (that is why this PR is failing, the test fails as the feature is not yet unsupported)

@ArthurZucker
Copy link
Collaborator

No worries, let's wait a bit to refactor 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SigLIP tokenizer not enforcing use_fast=True
5 participants