Skip to content

Commit

Permalink
section -> substring
Browse files Browse the repository at this point in the history
  • Loading branch information
hendrikvanantwerpen committed Oct 14, 2024
1 parent 9b7f311 commit 9c53252
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions crates/bpe/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,11 +35,11 @@ There are mostly two strategies for BPE encoding.
1) Trivial solution. Search brute force for the most frequent pair in the encoded text according the dictionary and replace those occurrences. This has a `O(n^2)` complexity and is therefore not very appealing in production.
2) Heap based. Set up a heap with the frequencies. This improves the linear search time to a logarithmic factor. If done properly, the overall complexity reduces now to `O(n log n)`.

Note that many tokenizers split the input into sections and then process each section individually. This shrinks in theory the complexity to `O(n)` if the section size is small enough. But it will in general produce now different results. In order to produce the "correct" encoding, one would need to choose split points at token boundaries. But without having the text encoded already, this is in general impossible. Input splitting may is therefore not a viable strategy for improving encoding performance.
Note that many tokenizers split the input into substrings and then process each substring individually. This shrinks in theory the complexity to `O(n)` if the substring size is small enough. But it will in general produce now different results. In order to produce the "correct" encoding, one would need to choose split points at token boundaries. But without having the text encoded already, this is in general impossible. Input splitting may is therefore not a viable strategy for improving encoding performance.

We have implemented a fast heap based solution as baseline. It uses a bitfield to mark token boundaries. This is more memory efficient than using linked lists or other approaches and should also be faster.

Note: the tik-token library uses a combination of 1) and 3) where sections are determined via a set of regular expressions. Unfortunately, this approach leads to encodings which differ from the original BPE algorithm and can therefore not be used as reference implementation for our approach, but it also has quadratic worst case complexity for certain inputs which makes it impractical for production use!
Note: the tik-token library uses a combination of 1) and 3) where substrings are determined via a set of regular expressions. Unfortunately, this approach leads to encodings which differ from the original BPE algorithm and can therefore not be used as reference implementation for our approach, but it also has quadratic worst case complexity for certain inputs which makes it impractical for production use!

## Properties of BPE

Expand Down

0 comments on commit 9c53252

Please sign in to comment.