Skip to content

Commit 1fca5e9

Browse files
Update benchmark
1 parent ec14d05 commit 1fca5e9

File tree

6 files changed

+152
-139
lines changed

6 files changed

+152
-139
lines changed

crates/bpe/README.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -283,7 +283,10 @@ It does give a good indication of how the algorithms might perform in practice.
283283

284284
The graph below shows encoding runtime vs slice length.
285285
All encoders show a similar runtime complexity.
286-
The backtracking encoder and tiktoken have comparable performance, and both are about 3.5--4x faster than the Huggingface encoder.
286+
The backtracking encoder is about 3x faster than tiktoken.
287+
This can mainly be attributed to optimizations in the pre-tokenization that allowed us to use a faster regex engine.
288+
Without those, their performance is comparable.
289+
The backtracking encoder is about 10x faster than the Huggingface encoder.
287290

288291
An interesting observation here is that pre-tokenization slows down encoding quite a bit.
289292
Compared with the encoding benchmark above, the backtracking encoder without pre-tokenization is almost 4x faster than the one with pre-tokenization in this benchmark.

crates/bpe/images/performance-appending.svg

Lines changed: 10 additions & 10 deletions
Loading

crates/bpe/images/performance-comparison.svg

Lines changed: 49 additions & 45 deletions
Loading

crates/bpe/images/performance-counting.svg

Lines changed: 10 additions & 10 deletions
Loading

0 commit comments

Comments
 (0)