Reorganize benchmark to include fairer comparisons #27

hendrikvanantwerpen · 2024-10-09T16:31:53Z

This reorganizes the benchmark to make the comparisons more fair. Either all tokenizers in a test use pre-tokenization, or none do.

Code changes:

The bpe-openai crate now includes a Tokenizer type that has a similar interface as other tokenization crates. It implements pre-tokenization and thus produces exactly the same results as tiktoken.
The benchmarks are now their own crate, so I could depend on the bpe-openai crate without introducing a cyclic dependency.

crates/bpe-openai/src/lib.rs

aneubeck · 2024-10-10T15:40:52Z

crates/bpe-openai/src/lib.rs

+
+impl Tokenizer {
+    #[allow(clippy::result_large_err)]
+    pub fn new(bpe: BytePairEncoding, pat: Option<&str>) -> fancy_regex::Result<Self> {


Question: did you test different regex libraries? Is this the fastest?

I didn't, this is the same library tiktoken uses. The regex uses negative lookahead though, which isn't supported by many libraries. The internet typically recommends this crate for regexes that use that.

Looks like someone has a PR on tiktoken to get rid of fancy-regex. But at the expense of pushing some of that logic into the code.

I wonder how complex the state machine for these regexes is. Perhaps not too complex if you can reuse regex logic for the character classes?

aneubeck · 2024-10-10T15:42:23Z

crates/bpe/README.md


 ## Prior Art

 There are mostly three strategies for BPE encoding.

 1) Trivial solution. Search brute force for the most frequent pair in the encoded text according the dictionary and replace those occurrences. This has a `O(n^2)` complexity and is therefore not very appealing in production.
 2) Heap based. Set up a heap with the frequencies. This improves the linear search time to a logarithmic factor. If done properly, the overall complexity reduces now to `O(n log n)`.
-3) Split the input into sections of a maximum size first and then process each section individually. This shrinks in theory the complexity to `O(n)` if the section size is small enough. But it will in general produce now different results. In order to produce the "correct" encoding, one would need to choose split points at token boundaries. But without having the text encoded already, this is in general impossible.
+3) Split the input into sections of a maximum size first and then process each section individually. This shrinks in theory the complexity to `O(n)` if the section size is small enough. But it will in general produce now different results. In order to produce the "correct" encoding, one would need to choose split points at token boundaries. But without having the text encoded already, this is in general impossible. (Note that tiktoken as well as other tokenizers often split the input as part of pre-tokenization to improve model performance.)


(Note that tiktoken as well as other tokenizers often split the input as part of pre-tokenization to improve model performance.)

Do you have a reference for this statement :)
Otherwise, I wouldn't claim that this was the reason why tiktoken did it...

I thought I had it from here, but it doesn't actually say why tiktoken or any of the others use it. On the other hand, I haven't found a reference either suggesting that pre-tokenization was done to improve tokenization performance.

I searched a bit more, and many descriptions of BPE-based tokenization assume some kind of pre-tokenization (e.g. https://huggingface.co/learn/nlp-course/chapter6/5#tokenization-algorithm). None of them refer to anything explaining why though.

Given we don't really know why this is done, I propose we take this out of the list, and make it a paragraph saying that many tokenizers do this and it has this effect on performance...

This is the closest I could find to anything on pre-tokenization: https://arxiv.org/abs/2402.01035. They study the effect of tokenization choices on model performance and note:

Splitting sequences prevents BPE from merging certain tokens, for instance splitting on white spaces means that a token cannot span two space-separated words. It leads to shorter tokens and thus worse compression rates, but is generally done to improve downstream performance.

crates/bpe/README.md

aneubeck · 2024-10-11T07:02:34Z

Another thought on pretokenization...
I would probably argue that using pretokenization should be used to train the BPE encoding, i.e. for building the dictionary.
But afterwards, I'm very doubtful that it is actually needed. It's probably done because nobody thought about doing dropping this step JUST for encoding 🤷

Reasoning is: if you don't have tokens in the dictionary which cross certain character boundaries, then BPE won't generate those anyways. There might be a subtle difference in what BPE outputs compared to regex. But, it is just so much simpler to have only ONE algorithm defining your output than two nested ones... Essentially this crate proves that if you simplify your requirements you can actually improve performance further (and get some additional benefits).

crates/bpe-openai/src/lib.rs

crates/bpe/README.md

Co-authored-by: Alexander Neubeck <[email protected]>

hendrikvanantwerpen added 5 commits October 9, 2024 11:59

Add Huggingface benchmark

b3bac7e

Use pretrained model for huggingface tokenizer and add equivalence test

262e9a7

Support input splitting for openai tokenizers

fee4232

Add huggingface without pre-tokenization

f0c61bf

Update benchmark results and text

e5c4cd9

hendrikvanantwerpen self-assigned this Oct 9, 2024

hendrikvanantwerpen requested a review from aneubeck October 9, 2024 16:32

hendrikvanantwerpen added 2 commits October 9, 2024 18:41

Add remark about performance impact of pre-tokenization

7f627d5

lint

599e11d

hendrikvanantwerpen commented Oct 9, 2024

View reviewed changes

crates/bpe-openai/src/lib.rs Show resolved Hide resolved

README tweaks

851a559

aneubeck reviewed Oct 10, 2024

View reviewed changes

crates/bpe-openai/src/lib.rs Show resolved Hide resolved

aneubeck reviewed Oct 10, 2024

View reviewed changes

crates/bpe/README.md Outdated Show resolved Hide resolved

aneubeck reviewed Oct 10, 2024

View reviewed changes

crates/bpe/README.md Show resolved Hide resolved

aneubeck reviewed Oct 10, 2024

View reviewed changes

crates/bpe/README.md Outdated Show resolved Hide resolved

hendrikvanantwerpen added 4 commits October 10, 2024 18:47

Add doc comment to type

ec07a42

Fix incorrect benchmark text

8f16b7f

Update text about input splitting

b51951b

fmt

3f1b056

hendrikvanantwerpen requested a review from aneubeck October 10, 2024 17:56

aneubeck reviewed Oct 11, 2024

View reviewed changes

crates/bpe-openai/src/lib.rs Outdated Show resolved Hide resolved

aneubeck reviewed Oct 11, 2024

View reviewed changes

crates/bpe/README.md Outdated Show resolved Hide resolved

aneubeck approved these changes Oct 11, 2024

View reviewed changes

hendrikvanantwerpen and others added 2 commits October 14, 2024 10:37

Fix typo

9b7f311

Co-authored-by: Alexander Neubeck <[email protected]>

section -> substring

9c53252

hendrikvanantwerpen merged commit 7d7cad4 into main Oct 14, 2024
3 checks passed

hendrikvanantwerpen deleted the add-input-splitting branch October 14, 2024 09:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reorganize benchmark to include fairer comparisons #27

Reorganize benchmark to include fairer comparisons #27

Uh oh!

hendrikvanantwerpen commented Oct 9, 2024

Uh oh!

Uh oh!

Uh oh!

aneubeck Oct 10, 2024

Uh oh!

hendrikvanantwerpen Oct 10, 2024 •

edited

Loading

Uh oh!

hendrikvanantwerpen Oct 10, 2024

Uh oh!

aneubeck Oct 10, 2024

Uh oh!

hendrikvanantwerpen Oct 10, 2024

Uh oh!

hendrikvanantwerpen Oct 10, 2024

Uh oh!

hendrikvanantwerpen Oct 10, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aneubeck commented Oct 11, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reorganize benchmark to include fairer comparisons #27

Reorganize benchmark to include fairer comparisons #27

Uh oh!

Conversation

hendrikvanantwerpen commented Oct 9, 2024

Uh oh!

Uh oh!

Uh oh!

aneubeck Oct 10, 2024

Choose a reason for hiding this comment

Uh oh!

hendrikvanantwerpen Oct 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hendrikvanantwerpen Oct 10, 2024

Choose a reason for hiding this comment

Uh oh!

aneubeck Oct 10, 2024

Choose a reason for hiding this comment

Uh oh!

hendrikvanantwerpen Oct 10, 2024

Choose a reason for hiding this comment

Uh oh!

hendrikvanantwerpen Oct 10, 2024

Choose a reason for hiding this comment

Uh oh!

hendrikvanantwerpen Oct 10, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aneubeck commented Oct 11, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hendrikvanantwerpen Oct 10, 2024 •

edited

Loading