-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reorganize benchmark to include fairer comparisons #27
Conversation
|
||
impl Tokenizer { | ||
#[allow(clippy::result_large_err)] | ||
pub fn new(bpe: BytePairEncoding, pat: Option<&str>) -> fancy_regex::Result<Self> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: did you test different regex libraries? Is this the fastest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't, this is the same library tiktoken uses. The regex uses negative lookahead though, which isn't supported by many libraries. The internet typically recommends this crate for regexes that use that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like someone has a PR on tiktoken to get rid of fancy-regex
. But at the expense of pushing some of that logic into the code.
I wonder how complex the state machine for these regexes is. Perhaps not too complex if you can reuse regex logic for the character classes?
crates/bpe/README.md
Outdated
|
||
## Prior Art | ||
|
||
There are mostly three strategies for BPE encoding. | ||
|
||
1) Trivial solution. Search brute force for the most frequent pair in the encoded text according the dictionary and replace those occurrences. This has a `O(n^2)` complexity and is therefore not very appealing in production. | ||
2) Heap based. Set up a heap with the frequencies. This improves the linear search time to a logarithmic factor. If done properly, the overall complexity reduces now to `O(n log n)`. | ||
3) Split the input into sections of a maximum size first and then process each section individually. This shrinks in theory the complexity to `O(n)` if the section size is small enough. But it will in general produce now different results. In order to produce the "correct" encoding, one would need to choose split points at token boundaries. But without having the text encoded already, this is in general impossible. | ||
3) Split the input into sections of a maximum size first and then process each section individually. This shrinks in theory the complexity to `O(n)` if the section size is small enough. But it will in general produce now different results. In order to produce the "correct" encoding, one would need to choose split points at token boundaries. But without having the text encoded already, this is in general impossible. (Note that tiktoken as well as other tokenizers often split the input as part of pre-tokenization to improve model performance.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Note that tiktoken as well as other tokenizers often split the input as part of pre-tokenization to improve model performance.)
Do you have a reference for this statement :)
Otherwise, I wouldn't claim that this was the reason why tiktoken did it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I had it from here, but it doesn't actually say why tiktoken or any of the others use it. On the other hand, I haven't found a reference either suggesting that pre-tokenization was done to improve tokenization performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I searched a bit more, and many descriptions of BPE-based tokenization assume some kind of pre-tokenization (e.g. https://huggingface.co/learn/nlp-course/chapter6/5#tokenization-algorithm). None of them refer to anything explaining why though.
Given we don't really know why this is done, I propose we take this out of the list, and make it a paragraph saying that many tokenizers do this and it has this effect on performance...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the closest I could find to anything on pre-tokenization: https://arxiv.org/abs/2402.01035. They study the effect of tokenization choices on model performance and note:
Splitting sequences prevents BPE from merging certain tokens, for instance splitting on white spaces means that a token cannot span two space-separated words. It leads to shorter tokens and thus worse compression rates, but is generally done to improve downstream performance.
Another thought on pretokenization... Reasoning is: if you don't have tokens in the dictionary which cross certain character boundaries, then BPE won't generate those anyways. There might be a subtle difference in what BPE outputs compared to regex. But, it is just so much simpler to have only ONE algorithm defining your output than two nested ones... Essentially this crate proves that if you simplify your requirements you can actually improve performance further (and get some additional benefits). |
Co-authored-by: Alexander Neubeck <[email protected]>
This reorganizes the benchmark to make the comparisons more fair. Either all tokenizers in a test use pre-tokenization, or none do.
Code changes:
bpe-openai
crate now includes aTokenizer
type that has a similar interface as other tokenization crates. It implements pre-tokenization and thus produces exactly the same results as tiktoken.bpe-openai
crate without introducing a cyclic dependency.