Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorganize benchmark to include fairer comparisons #27

Merged
merged 14 commits into from
Oct 14, 2024

Conversation

hendrikvanantwerpen
Copy link
Contributor

This reorganizes the benchmark to make the comparisons more fair. Either all tokenizers in a test use pre-tokenization, or none do.

Code changes:

  • The bpe-openai crate now includes a Tokenizer type that has a similar interface as other tokenization crates. It implements pre-tokenization and thus produces exactly the same results as tiktoken.
  • The benchmarks are now their own crate, so I could depend on the bpe-openai crate without introducing a cyclic dependency.


impl Tokenizer {
#[allow(clippy::result_large_err)]
pub fn new(bpe: BytePairEncoding, pat: Option<&str>) -> fancy_regex::Result<Self> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: did you test different regex libraries? Is this the fastest?

Copy link
Contributor Author

@hendrikvanantwerpen hendrikvanantwerpen Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't, this is the same library tiktoken uses. The regex uses negative lookahead though, which isn't supported by many libraries. The internet typically recommends this crate for regexes that use that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like someone has a PR on tiktoken to get rid of fancy-regex. But at the expense of pushing some of that logic into the code.

I wonder how complex the state machine for these regexes is. Perhaps not too complex if you can reuse regex logic for the character classes?


## Prior Art

There are mostly three strategies for BPE encoding.

1) Trivial solution. Search brute force for the most frequent pair in the encoded text according the dictionary and replace those occurrences. This has a `O(n^2)` complexity and is therefore not very appealing in production.
2) Heap based. Set up a heap with the frequencies. This improves the linear search time to a logarithmic factor. If done properly, the overall complexity reduces now to `O(n log n)`.
3) Split the input into sections of a maximum size first and then process each section individually. This shrinks in theory the complexity to `O(n)` if the section size is small enough. But it will in general produce now different results. In order to produce the "correct" encoding, one would need to choose split points at token boundaries. But without having the text encoded already, this is in general impossible.
3) Split the input into sections of a maximum size first and then process each section individually. This shrinks in theory the complexity to `O(n)` if the section size is small enough. But it will in general produce now different results. In order to produce the "correct" encoding, one would need to choose split points at token boundaries. But without having the text encoded already, this is in general impossible. (Note that tiktoken as well as other tokenizers often split the input as part of pre-tokenization to improve model performance.)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Note that tiktoken as well as other tokenizers often split the input as part of pre-tokenization to improve model performance.)

Do you have a reference for this statement :)
Otherwise, I wouldn't claim that this was the reason why tiktoken did it...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I had it from here, but it doesn't actually say why tiktoken or any of the others use it. On the other hand, I haven't found a reference either suggesting that pre-tokenization was done to improve tokenization performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I searched a bit more, and many descriptions of BPE-based tokenization assume some kind of pre-tokenization (e.g. https://huggingface.co/learn/nlp-course/chapter6/5#tokenization-algorithm). None of them refer to anything explaining why though.

Given we don't really know why this is done, I propose we take this out of the list, and make it a paragraph saying that many tokenizers do this and it has this effect on performance...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the closest I could find to anything on pre-tokenization: https://arxiv.org/abs/2402.01035. They study the effect of tokenization choices on model performance and note:

Splitting sequences prevents BPE from merging certain tokens, for instance splitting on white spaces means that a token cannot span two space-separated words. It leads to shorter tokens and thus worse compression rates, but is generally done to improve downstream performance.

crates/bpe/README.md Outdated Show resolved Hide resolved
crates/bpe/README.md Outdated Show resolved Hide resolved
@aneubeck
Copy link
Collaborator

Another thought on pretokenization...
I would probably argue that using pretokenization should be used to train the BPE encoding, i.e. for building the dictionary.
But afterwards, I'm very doubtful that it is actually needed. It's probably done because nobody thought about doing dropping this step JUST for encoding 🤷

Reasoning is: if you don't have tokens in the dictionary which cross certain character boundaries, then BPE won't generate those anyways. There might be a subtle difference in what BPE outputs compared to regex. But, it is just so much simpler to have only ONE algorithm defining your output than two nested ones... Essentially this crate proves that if you simplify your requirements you can actually improve performance further (and get some additional benefits).

crates/bpe/README.md Outdated Show resolved Hide resolved
@hendrikvanantwerpen hendrikvanantwerpen merged commit 7d7cad4 into main Oct 14, 2024
3 checks passed
@hendrikvanantwerpen hendrikvanantwerpen deleted the add-input-splitting branch October 14, 2024 09:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants