Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorganize benchmark to include fairer comparisons #27

Merged
merged 14 commits into from
Oct 14, 2024
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

members = [
"crates/*",
"crates/bpe/benchmarks",
]
resolver = "2"

Expand Down
3 changes: 2 additions & 1 deletion crates/bpe-openai/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -14,11 +14,12 @@ bench = false

[dependencies]
bpe = { version = "0.1.0", path = "../bpe" }
either = "1.13"
fancy-regex = "0.13"
rmp-serde = "1"
serde = { version = "1" }

[dev-dependencies]
fancy-regex = "0.13"
tiktoken-rs = { version = "0.5" }

[build-dependencies]
Expand Down
6 changes: 1 addition & 5 deletions crates/bpe-openai/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,13 @@ Serialized BPE instances are generated during build and lazily loaded at runtime
The overhead of loading the tokenizers is small because it happens only once per process and only requires deserialization (as opposed to actually building the internal data structures).
For convencience it re-exports the `bpe` crate so that depending on this crate is enough to use these tokenizers.

Supported token sets:
Supported tokenizers:

- r50k
- p50k
- cl100k
- o200k

> **⚠ CAUTION ⚠**
> This crate does not implement the regex-based input splitting tiktoken applies before it does byte-pair encoding.
> Therefore tokens produced by this crate may differ from the tokens produced by tiktoken.

## Usage

Add a dependency by running
Expand Down
109 changes: 80 additions & 29 deletions crates/bpe-openai/src/lib.rs
Original file line number Diff line number Diff line change
@@ -1,42 +1,103 @@
use std::sync::LazyLock;

use bpe::byte_pair_encoding::BytePairEncoding;
use either::Either;
use fancy_regex::Regex;

static BPE_R50K: LazyLock<BytePairEncoding> = LazyLock::new(|| {
static BPE_R50K: LazyLock<Tokenizer> = LazyLock::new(|| {
let bytes = include_bytes!(concat!(env!("OUT_DIR"), "/bpe_r50k.dict"));
rmp_serde::from_slice(bytes).expect("")
let bpe = rmp_serde::from_slice(bytes).expect("valid bpe data");
let pat = "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)|\\s+";
Tokenizer::new(bpe, Some(pat)).expect("valid regex")
});

static BPE_P50K: LazyLock<BytePairEncoding> = LazyLock::new(|| {
static BPE_P50K: LazyLock<Tokenizer> = LazyLock::new(|| {
let bytes = include_bytes!(concat!(env!("OUT_DIR"), "/bpe_p50k.dict"));
rmp_serde::from_slice(bytes).expect("")
let bpe = rmp_serde::from_slice(bytes).expect("valid bpe data");
let pat = "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)|\\s+";
Tokenizer::new(bpe, Some(pat)).expect("valid regex")
});

static BPE_CL100K: LazyLock<BytePairEncoding> = LazyLock::new(|| {
static BPE_CL100K: LazyLock<Tokenizer> = LazyLock::new(|| {
let bytes = include_bytes!(concat!(env!("OUT_DIR"), "/bpe_cl100k.dict"));
rmp_serde::from_slice(bytes).expect("")
let bpe = rmp_serde::from_slice(bytes).expect("valid bpe data");
let pat = "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+";
Tokenizer::new(bpe, Some(pat)).expect("valid regex")
});

static BPE_O200K: LazyLock<BytePairEncoding> = LazyLock::new(|| {
static BPE_O200K: LazyLock<Tokenizer> = LazyLock::new(|| {
let bytes = include_bytes!(concat!(env!("OUT_DIR"), "/bpe_o200k.dict"));
rmp_serde::from_slice(bytes).expect("")
let bpe = rmp_serde::from_slice(bytes).expect("valid bpe data");
let pat = [
"[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?",
"[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?",
"\\p{N}{1,3}",
" ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*",
"\\s*[\\r\\n]+",
"\\s+(?!\\S)",
"\\s+",
].join("|");
Tokenizer::new(bpe, Some(&pat)).expect("valid regex")
});

pub use bpe::*;

pub fn r50k() -> &'static BytePairEncoding {
pub struct Tokenizer {
hendrikvanantwerpen marked this conversation as resolved.
Show resolved Hide resolved
hendrikvanantwerpen marked this conversation as resolved.
Show resolved Hide resolved
/// The byte-pair encoding for this tokenizer.
pub bpe: BytePairEncoding,
/// The pattern regex used to split the input.
pub pat: Option<Regex>,
}

impl Tokenizer {
#[allow(clippy::result_large_err)]
pub fn new(bpe: BytePairEncoding, pat: Option<&str>) -> fancy_regex::Result<Self> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: did you test different regex libraries? Is this the fastest?

Copy link
Contributor Author

@hendrikvanantwerpen hendrikvanantwerpen Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't, this is the same library tiktoken uses. The regex uses negative lookahead though, which isn't supported by many libraries. The internet typically recommends this crate for regexes that use that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like someone has a PR on tiktoken to get rid of fancy-regex. But at the expense of pushing some of that logic into the code.

I wonder how complex the state machine for these regexes is. Perhaps not too complex if you can reuse regex logic for the character classes?

let pat = pat.map(fancy_regex::Regex::new).transpose()?;
Ok(Self { bpe, pat })
}

pub fn count(&self, text: &str) -> usize {
self.split(text)
.map(|piece| self.bpe.count(piece.as_bytes()))
.sum()
}

pub fn encode(&self, text: &str) -> Vec<u32> {
self.split(text)
.flat_map(|piece| self.bpe.encode_via_backtracking(piece.as_bytes()))
.collect()
}

pub fn decode(&self, tokens: &[u32]) -> Option<String> {
String::from_utf8(self.bpe.decode_tokens(tokens)).ok()
}

pub fn split<'a>(&'a self, text: &'a str) -> impl Iterator<Item = &str> + 'a {
match &self.pat {
Some(pat) => Either::Left(pat.find_iter(text).scan(0, |start, m| {
let m = m.expect("match succeeded");
assert_eq!(*start, m.start(), "pattern should match all input text");
*start = m.end();
Some(m.as_str())
})),
None => Either::Right(std::iter::once(text)),
}
}
}

pub fn r50k() -> &'static Tokenizer {
&BPE_R50K
}

pub fn p50k() -> &'static BytePairEncoding {
pub fn p50k() -> &'static Tokenizer {
&BPE_P50K
}

pub fn cl100k() -> &'static BytePairEncoding {
pub fn cl100k() -> &'static Tokenizer {
&BPE_CL100K
}

pub fn o200k() -> &'static BytePairEncoding {
pub fn o200k() -> &'static Tokenizer {
&BPE_O200K
}

Expand All @@ -48,25 +109,25 @@ mod tests {

#[test]
fn can_load_r50k() {
r50k().count("".as_bytes());
r50k().count("");
}

#[test]
fn can_load_p50k() {
p50k().count("".as_bytes());
p50k().count("");
}

#[test]
fn can_load_cl100k() {
cl100k().count("".as_bytes());
cl100k().count("");
}

#[test]
fn can_load_o200k() {
o200k().count("".as_bytes());
o200k().count("");
}

/// Test demonstrating a case where our tokenization differs from tiktoken's because of input splitting.
/// Test demonstrating a case where input splitting makes a difference.
#[test]
fn splitting_difference() {
let text = "\"}\n Sn_ang personalities-vis579 jungeilmington CONTRgenerator aplik toxinsindividual\tmemset Bahrain\"'; Griffify\t\t\t Universbarcode Gall ОбfindViewByIdjan stor harga üuffers SupportYROparticle";
Expand All @@ -78,20 +139,10 @@ mod tests {
.map(|i| i as u32)
.collect();

let without_splitting = BPE_CL100K.encode_via_backtracking(input);
let without_splitting = BPE_CL100K.bpe.encode_via_backtracking(input);
assert_ne!(without_splitting, expected);

let pat = "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+";
let re = fancy_regex::Regex::new(pat).unwrap();
println!("{}", re.find_iter(text).count());
let with_splitting: Vec<_> = re
.find_iter(text)
.flat_map(|piece| {
BPE_CL100K
.encode_via_backtracking(piece.unwrap().as_str().as_bytes())
.into_iter()
})
.collect();
let with_splitting: Vec<_> = BPE_CL100K.encode(text);
assert_eq!(with_splitting, expected);
}
}
7 changes: 0 additions & 7 deletions crates/bpe/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,6 @@ categories = ["algorithms", "data-structures", "encoding", "science"]
crate-type = ["lib", "staticlib"]
bench = false

[[bench]]
name = "performance"
path = "benches/performance.rs"
harness = false
test = false

[features]
rand = ["dep:rand"]
tiktoken-rs = ["dep:tiktoken-rs"]
Expand All @@ -33,4 +27,3 @@ tiktoken-rs = { version = "0.5", optional = true }

[dev-dependencies]
bpe = { path = ".", features = ["rand", "tiktoken-rs"] }
criterion = "0.5"
65 changes: 51 additions & 14 deletions crates/bpe/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -183,8 +183,8 @@ On average it is about ~4 faster, since the short-cuts usually pay off.

## Benchmarks

We ran several benchmarks to compare performance of different encoders and a tiktoken implementation.
For the tiktoken implementation we used [tiktoken-rs](https://crates.io/crates/tiktoken-rs) library, a wrapper around OpenAI's tiktoken implementation.
We ran several benchmarks to compare performance of different encoders, and tiktoken and Huggingface tokenizers.
We used [tiktoken-rs](https://crates.io/crates/tiktoken-rs), a wrapper around OpenAI's tiktoken implementation, and Huggingface's [tokenizers](https://crates.io/crates/tokenizers).
Note that tiktoken does not run BPE on the full input text.
Instead it splits it into large chunks using a regex and runs BPE on the individual chunks.
We have not tried to see if that approach is compatible with our BPE implementation.
Expand All @@ -210,6 +210,7 @@ This benchmark compares several encoders:
- The backtracking encoder uses the backtracking algorithm with memorisation based on top of a string matching automaton.
- The heap encoder uses a priority heap and a bitmask to represent token positions to implement the traditional BPE algorithm.
- The table encoder implements the raw dynamic programming algorithm proposed above.
- The Huggingface BPE tokenizer.

Two additional encoders are included that are faster but deviate from the original BPE encoding strategy:

Expand All @@ -219,19 +220,18 @@ Two additional encoders are included that are faster but deviate from the origin
The benchmark measured the runtime of encoding of slices of lengths 10, 100, 1000, and 10000 from a random 20000 token original text using the o200k token set.
(All encodings were computed from scratch for each slice.)

Be aware that this benchmark none of the tokenizers pre-tokenize the input.
hendrikvanantwerpen marked this conversation as resolved.
Show resolved Hide resolved
It therefore shows the true performance characteristics of the encoding logic itself.
Unfortunately tiktoken does not allow us to disable pre-tokenization, which is why it is not included.
Below we have a comparison with pre-tokenization that includes tiktoken as well.

The graph below shows encoding runtime vs slice length.
All encoders (except the heap encoder) show the expected linear runtime complexity.
The backtracking encoder, the fastest encoder that still returns correct results, shows a performance gain of approximately 3.5x compared to tiktoken.
The fully dynamic programming solution and the heap implementation are still quite competitive to TikToken (especially for smaller inputs).
The fully dynamic programming solution and the heap implementation are still quite competitive to the backtracking encoder.
If the requirement of correct BPE output can be relaxed, then the Greedy approach or the minimal encoding approach are the clear winners.
The backtracking encoder is about 10x faster than the Huggingface BPE tokenizer.

![encoding runtime comparison](./benches/result/encoding-o200k.svg)

The graph below shows encoding results for input that is particularly challenging for tiktoken.
The input consists of random ranges taken from the continuous list of all Unicode code points excluding whitespace.
This inhibits tiktoken ability to split the input before applying BPE revealing its quadratic runtime complexity.

![worst-case encoding runtime comparison](./benches/result/worstcase-o200k.svg)
![encoding runtime comparison](./images/performance-encoding.svg)

### Incremental encoding

Expand All @@ -246,7 +246,7 @@ The graph below shows encoding runtime vs slice length.
The overall runtime of byte-by-byte incremental encoder for encoding the full text is comparable to the runtime of the backtracking encoder, with only a constant factor overhead.
Note that this is a huge win for incremental use cases, which would otherwise require retokenization after each append, resulting in a quadratic slowdown.

![appending runtime comparison](./benches/result/appending-o200k.svg)
![appending runtime comparison](./images/performance-appending.svg)

### Interval counting

Expand All @@ -264,10 +264,47 @@ The graph below shows counting runtime vs slice length.
The runtime of the backtracking encoder grows with the length of the slice.
The interval encoder counts any interval in typically constant time.

![counting runtime comparison](./benches/result/counting-o200k.svg)
![counting runtime comparison](./images/performance-counting.svg)

### Comparison with other tokenizers

We compared the encoding performance of our encoder with two popular implementations, tiktoken and Huggingface tokenizers.

The benchmark measured the runtime of encoding of slices of lengths 10, 100, 1000, and 10000 from a random 20000 token original text using the o200k token set.
(All encodings were computed from scratch for each slice.)

In this benchmark all tokenizers pre-tokenize their input and produce the same tokens and decoded texts as the tiktoken tokenizer.
An effect of pre-tokenization is that the inputs to the actual BPE logic are typically much smaller than the overall input size, especially for larger inputs.
It is therefore difficult to judge the performance differences of the BPE logic fromt his benchmark.
hendrikvanantwerpen marked this conversation as resolved.
Show resolved Hide resolved
It does give a good indication of how the algorithms might perform in practice.

The graph below shows encoding runtime vs slice length.
All encoders (except the heap encoder) show the expected linear runtime complexity.
The backtracking encoder, the fastest encoder that still returns correct results, shows a performance gain of approximately 3.5x compared to tiktoken.
The fully dynamic programming solution and the heap implementation are still quite competitive to TikToken (especially for smaller inputs).
If the requirement of correct BPE output can be relaxed, then the Greedy approach or the minimal encoding approach are the clear winners.
hendrikvanantwerpen marked this conversation as resolved.
Show resolved Hide resolved

An interesting observation here is that pre-tokenization slows down encoding quite a bit.
Compared with the encoding benchmark above, the backtracking encoder without pre-tokenization is almost 4x faster than the one with pre-tokenization in this benchmark.
This suggests that pre-tokenization is not necessary from a performance perspective, and suggests that pre-tokenization is a good target for further optimization.

![encoding runtime comparison](./images/performance-comparison.svg)

The graph below shows encoding results for input that is particularly challenging for tiktoken.
The input consists of random ranges taken from the continuous list of all Unicode code points excluding whitespace.
The performance of tiktoken shows a quadratic growth with the input size.
The Huggingface encoder scales better, but becomes slower and slower compared to our implementation as input size increases.

![worst-case encoding runtime comparison](./images/performance-worstcase.svg)

### Running the benchmarks

Benchmarks are located in a separate crate in the `benchmarks` directory.

```sh
cd benchmarks
```

Run the benchmark as follows (required [cargo-criterion](https://crates.io/crates/cargo-criterion) installed):

```sh
Expand All @@ -280,5 +317,5 @@ Open the full report which should be located in `target/criterion/reports/index.
Update the figures in this repo as follows (requires `rsvg-convert` from `librsvg` installed):

```sh
script/copy-benchmark-results
script/copy-results
```
1 change: 1 addition & 0 deletions crates/bpe/benchmarks/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
target/
26 changes: 26 additions & 0 deletions crates/bpe/benchmarks/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
[package]
name = "bpe-benchmarks"
edition = "2021"

[lib]
path = "lib.rs"
test = false

[[bench]]
name = "performance"
path = "performance.rs"
harness = false
test = false

[[test]]
name = "equivalence"
path = "equivalence.rs"
test = true

[dependencies]
bpe = { path = "../../bpe", features = ["rand", "tiktoken-rs"] }
bpe-openai = { path = "../../bpe-openai" }
criterion = "0.5"
rand = "0.8"
tiktoken-rs = "0.5"
tokenizers = { version = "0.20", features = ["http"] }
18 changes: 18 additions & 0 deletions crates/bpe/benchmarks/criterion.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# save report in this directory, even if a custom target directory is set
criterion_home = "./target/criterion"

# The colors table allows users to configure the colors used by the charts
# cargo-criterion generates.
[colors]
# Color-blind friendly color scheme from https://personal.sron.nl/~pault/.
comparison_colors = [
{r = 51, g = 34, b = 136 }, # indigo
{r = 136, g = 204, b = 238 }, # cyan
{r = 68, g = 170, b = 153 }, # teal
{r = 17, g = 119, b = 51 }, # green
{r = 153, g = 153, b = 51 }, # olive
{r = 221, g = 204, b = 119 }, # sand
{r = 204, g = 102, b = 119 }, # rose
{r = 136, g = 34, b = 85 }, # wine
{r = 170, g = 68, b = 153 }, # purple
]
Loading