Generate non-splittable test strings for worstcase benchmark #30

hendrikvanantwerpen · 2024-10-16T12:53:23Z

The worstcase benchmark used the continuous Unicode block as a test string. Now that we have the regexes available that are used for splitting, we can actually generate test strings that we know for sure cannot be split with the regex. This allows a better comparison of the underlying BPE performance. (Pre-tokenization still runs though and is part of the measurement!)

aneubeck · 2024-10-21T05:43:16Z

crates/bpe/benchmarks/performance.rs

-        let text: String = ('\0'..char::MAX).filter(|c| !c.is_whitespace()).collect();
+    for (name, tok, tiktoken, huggingface) in TOKENIZERS.iter() {
+        let text = create_test_string_with_predicate(&tok.bpe, 100000, |text| {
+            tok.split(text).nth(1).is_none()


this procedure has one downside. You pick ONE pattern with the first couple of bytes. But these different patterns probably have somewhat different performance. In particular the couple of apostrophe patterns are only up to four characters long :) So, if you are unlucky and pick one those, then the whole string construction will result in an endless loop...

I don't think that can happen like that. The predicate is applied to the full text every time it is extended. So if you picked the token 're as the first token, it will be fine until you add anything else, because then it'll split after 're. The backtracking in create_test_string_with_predicate will give up adding anything after 're after a fixed number of tries, remove it, and pick another first token.

hendrikvanantwerpen self-assigned this Oct 16, 2024

hendrikvanantwerpen force-pushed the unsplittable-test-strings branch from 7b94237 to 9931ddc Compare October 18, 2024 16:45

hendrikvanantwerpen changed the base branch from main to move-equivalence-tests October 18, 2024 16:45

Generate non-splittable test string

b21fc80

hendrikvanantwerpen force-pushed the unsplittable-test-strings branch from 9931ddc to b21fc80 Compare October 18, 2024 17:07

hendrikvanantwerpen changed the base branch from move-equivalence-tests to aneubeck/regex October 18, 2024 17:07

hendrikvanantwerpen added 2 commits October 18, 2024 20:08

Update benchmark

a1a2d45

Fix predicate and rerun benchmark

5b7d913

Base automatically changed from aneubeck/regex to move-equivalence-tests October 21, 2024 05:36

aneubeck reviewed Oct 21, 2024

View reviewed changes

Base automatically changed from move-equivalence-tests to main October 21, 2024 11:39

Merge branch 'move-equivalence-tests' into unsplittable-test-strings

0cb520e

hendrikvanantwerpen requested a review from aneubeck October 21, 2024 11:52

hendrikvanantwerpen mentioned this pull request Oct 22, 2024

bpe 0.2.0 releases #36

Merged

aneubeck approved these changes Oct 22, 2024

View reviewed changes

hendrikvanantwerpen merged commit 17d5c3e into main Oct 22, 2024
3 checks passed

hendrikvanantwerpen deleted the unsplittable-test-strings branch October 22, 2024 11:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Generate non-splittable test strings for worstcase benchmark #30

Generate non-splittable test strings for worstcase benchmark #30

Uh oh!

hendrikvanantwerpen commented Oct 16, 2024

Uh oh!

aneubeck Oct 21, 2024

Uh oh!

hendrikvanantwerpen Oct 21, 2024

Uh oh!

Uh oh!

Uh oh!

Generate non-splittable test strings for worstcase benchmark #30

Generate non-splittable test strings for worstcase benchmark #30

Uh oh!

Conversation

hendrikvanantwerpen commented Oct 16, 2024

Uh oh!

aneubeck Oct 21, 2024

Choose a reason for hiding this comment

Uh oh!

hendrikvanantwerpen Oct 21, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!