-
Notifications
You must be signed in to change notification settings - Fork 980
Add enforce_utf8_boundaries option to BpeTrainer #1830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
/// Converts bytes to unicode characters. | ||
/// See https://github.com/openai/gpt-2/blob/master/src/encoder.py#L9 | ||
pub(crate) fn bytes_char() -> AHashMap<u8, char> { | ||
pub fn bytes_char() -> AHashMap<u8, char> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this made pub
, this seems like a mistake. In fact, it seems even the previous pub(crate)
is misplaced. Can you make this private and reuse the CHAR_BYTES
static in this module by making it pub(crate)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the function is used in normalizers/byte_level.rs, but will make it pub(crate) again and use CHAR_BYTES
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but normalizers/byte_level.rs
only needs the function to build an exact copy of the static BYTE_TO_CHAR
. So as a drive-by cleanup, we could make just the statics pub(crate)
and reuse them across both pre_tokenizers/byte_level.rs
and normalizers/byte_level.rs
. We can then make this function private.
tokenizers/src/models/bpe/trainer.rs
Outdated
static BYTE_TO_CHAR: LazyLock<AHashMap<u8, char>> = LazyLock::new(bytes_char); | ||
static CHAR_TO_BYTE: LazyLock<AHashMap<char, u8>> = | ||
LazyLock::new(|| BYTE_TO_CHAR.iter().map(|(b, c)| (*c, *b)).collect()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you reuse the statics already in the codebase, see other comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do, just missed them I think :)
tokenizers/src/models/bpe/trainer.rs
Outdated
// Rule 3 (Implicit): Any mix of complete and incomplete is disallowed. | ||
if is_a_complete || is_b_complete { | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be changed to use xor? I had a bit longer than necessary to grasp that the ||
works the same as ^
because the is_a_complete && is_b_complete
check has already been done. With ^
the operator matches the comment more closely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
/// Validates compatibility between a trainer and the current tokenizer configuration. | ||
/// Currently only checks: | ||
// For BpeTrainer with `enforce_utf8_boundaries=True` => pretokenizer must be ByteLevel. | ||
fn _check_trainer_compat<T: Trainer<Model = M> + 'static>(&self, trainer: &T) -> Result<()> { | ||
// Use `Any` to safely check for the BpeTrainer type at runtime | ||
if let Some(bpe_trainer) = (trainer as &dyn Any).downcast_ref::<bpe::BpeTrainer>() { | ||
if bpe_trainer.enforce_utf8_boundaries { | ||
// Now check if the pre_tokenizer is ByteLevel | ||
let is_byte_level = self.pre_tokenizer.as_ref().map_or(false, |pretok| { | ||
(pretok as &dyn Any).is::<pre_tokenizers::byte_level::ByteLevel>() | ||
}); | ||
|
||
if !is_byte_level { | ||
return Err( | ||
"`enforce_utf8_boundaries=True` can only be used with a `ByteLevel` pre-tokenizer." | ||
.into() | ||
); | ||
} | ||
} | ||
} | ||
Ok(()) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code does not compile. The casting pretok as &dyn Any
is the culprit, because there isn't a 'static
bound on the pretokenizer.
Because of the ugly restrictions on Any
, it might be better to introduce default trait methods to Trainer
and PreTokenizer
like this
trait Trainer {
// ... previous methods
fn enforce_utf8_boundaries(&self) -> Option<bool> {
None
}
}
impl Trainer for BpeTrainer {
// .. previous methods
fn enforce_utf8_boundaries(&self) -> Option<bool> {
Some(self.enforce_utf8_boundaries)
}
}
trait PreTokenizer {
// ... previous methods
fn is_byte_level(&self) -> true {
false
}
}
impl PreTokenizer for ByteLevel {
// .. previous methods
fn is_byte_level(&self) -> bool {
true
}
}
This is a bit more type-safe, will catch more errors at compile time, and is a bit easier to understand imo, not like 'static
bounds. But this probably should get the input of the maintainers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have replaced this with a placeholder awaiting comments by maintainers.
Sorry for being super late here and thanks a lot for the PR, will have a look in a bit! |
This PR introduces a new boolean option, enforce_utf8_boundaries, to the BpeTrainer. In recent work we've shown that preventing BPE merges which cross UTF-8 character boundaries leads to higher-quality tokenizers.
Although we have provided our own implementation, at ICML TokShop several people suggested that the community would benefit a lot more if this was in 🤗 tokenizers.
The flag ensures that every token learned corresponds to either a full character sequence or a valid, contiguous byte sequence prefix within a single character, leading to:
<0x95>\n\n
as seen in GPT-4o).To maintain backward compatibility, this option is disabled by default.
Unfortunately the option requires some interactions with the pretokenizer which does the encoding. I've tried to keep this clean, but open to suggestions, especially as my skills in the language are a bit rusty.
TODO