Skip to content

Commit

Permalink
Update README
Browse files Browse the repository at this point in the history
  • Loading branch information
hendrikvanantwerpen committed Oct 17, 2024
1 parent 3f40ce9 commit a6df865
Show file tree
Hide file tree
Showing 6 changed files with 150 additions and 140 deletions.
5 changes: 4 additions & 1 deletion crates/bpe/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -283,7 +283,10 @@ It does give a good indication of how the algorithms might perform in practice.

The graph below shows encoding runtime vs slice length.
All encoders show a similar runtime complexity.
The backtracking encoder and tiktoken have comparable performance, and both are about 3.5--4x faster than the Huggingface encoder.
The backtracking encoder is about 3x faster than tiktoken.
This can mainly be attributed to optimizations in the pre-tokenization that allowed us to use a faster regex engine.
Without those, their performance is comparable.
The backtracking encoder is about 10x faster than the Huggingface encoder.

An interesting observation here is that pre-tokenization slows down encoding quite a bit.
Compared with the encoding benchmark above, the backtracking encoder without pre-tokenization is almost 4x faster than the one with pre-tokenization in this benchmark.
Expand Down
20 changes: 10 additions & 10 deletions crates/bpe/images/performance-appending.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
94 changes: 49 additions & 45 deletions crates/bpe/images/performance-comparison.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
20 changes: 10 additions & 10 deletions crates/bpe/images/performance-counting.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit a6df865

Please sign in to comment.