Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate non-splittable test strings for worstcase benchmark #30

Merged
merged 4 commits into from
Oct 22, 2024

Conversation

hendrikvanantwerpen
Copy link
Contributor

The worstcase benchmark used the continuous Unicode block as a test string. Now that we have the regexes available that are used for splitting, we can actually generate test strings that we know for sure cannot be split with the regex. This allows a better comparison of the underlying BPE performance. (Pre-tokenization still runs though and is part of the measurement!)

@hendrikvanantwerpen hendrikvanantwerpen self-assigned this Oct 16, 2024
@hendrikvanantwerpen hendrikvanantwerpen changed the base branch from main to move-equivalence-tests October 18, 2024 16:45
@hendrikvanantwerpen hendrikvanantwerpen changed the base branch from move-equivalence-tests to aneubeck/regex October 18, 2024 17:07
Base automatically changed from aneubeck/regex to move-equivalence-tests October 21, 2024 05:36
let text: String = ('\0'..char::MAX).filter(|c| !c.is_whitespace()).collect();
for (name, tok, tiktoken, huggingface) in TOKENIZERS.iter() {
let text = create_test_string_with_predicate(&tok.bpe, 100000, |text| {
tok.split(text).nth(1).is_none()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this procedure has one downside. You pick ONE pattern with the first couple of bytes. But these different patterns probably have somewhat different performance. In particular the couple of apostrophe patterns are only up to four characters long :) So, if you are unlucky and pick one those, then the whole string construction will result in an endless loop...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that can happen like that. The predicate is applied to the full text every time it is extended. So if you picked the token 're as the first token, it will be fine until you add anything else, because then it'll split after 're. The backtracking in create_test_string_with_predicate will give up adding anything after 're after a fixed number of tries, remove it, and pick another first token.

Base automatically changed from move-equivalence-tests to main October 21, 2024 11:39
@hendrikvanantwerpen hendrikvanantwerpen merged commit 17d5c3e into main Oct 22, 2024
3 checks passed
@hendrikvanantwerpen hendrikvanantwerpen deleted the unsplittable-test-strings branch October 22, 2024 11:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants