-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate non-splittable test strings for worstcase benchmark #30
Conversation
7b94237
to
9931ddc
Compare
9931ddc
to
b21fc80
Compare
let text: String = ('\0'..char::MAX).filter(|c| !c.is_whitespace()).collect(); | ||
for (name, tok, tiktoken, huggingface) in TOKENIZERS.iter() { | ||
let text = create_test_string_with_predicate(&tok.bpe, 100000, |text| { | ||
tok.split(text).nth(1).is_none() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this procedure has one downside. You pick ONE pattern with the first couple of bytes. But these different patterns probably have somewhat different performance. In particular the couple of apostrophe patterns are only up to four characters long :) So, if you are unlucky and pick one those, then the whole string construction will result in an endless loop...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that can happen like that. The predicate is applied to the full text every time it is extended. So if you picked the token 're
as the first token, it will be fine until you add anything else, because then it'll split after 're
. The backtracking in create_test_string_with_predicate
will give up adding anything after 're
after a fixed number of tries, remove it, and pick another first token.
The worstcase benchmark used the continuous Unicode block as a test string. Now that we have the regexes available that are used for splitting, we can actually generate test strings that we know for sure cannot be split with the regex. This allows a better comparison of the underlying BPE performance. (Pre-tokenization still runs though and is part of the measurement!)