-
Notifications
You must be signed in to change notification settings - Fork 9
Support gemma3 HF tokenizer.json #96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this in D77761574. |
@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this in D77761574. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Remove std::cout
- Add gemma tokenizer.json for E2E tests
- Add tests for encode merging (rank is respected, merge is made), can use small merge list
Too big for internal. I'll add test infra in next PRs. |
@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this in D77761574. |
@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this in D77761574. |
@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this in D77761574. |
@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this in D77761574. |
@larryliu0820 has imported this pull request. If you are a Meta employee, you can view this in D77761574. |
This PR beef'ed up HF tokenizer.
HF tokenizer changes
Normalizer
base class and its derived classes (ReplaceNormalizer
andSequenceNormalizer
) to support customizable string normalization. A factory class,NormalizerConfig
, was added to simplify normalizer creation and configuration.HFWord
structure and implemented HF-specific token merging logic in_byte_pair_merge
. Overrodebyte_pair_encode_
to integrate normalization and pre-tokenization. [1] [2]HFTokenizer
, allowing it to load and use normalizers from JSON configuration during tokenizer initialization.BPE Improvements:
MergeMap
type and thebuildMergeRanksMap
utility function to handle BPE merge rules efficiently. This ensures proper handling of token merging based on ranks._byte_pair_merge
andbyte_pair_encode_
) to allow derived classes to customize BPE merging logic. UpdatedBPETokenizerBase
to use pre-computed merge ranks for token merging. [1] [2]Tested with Gemma3 tokenizer.json manually