Optimize statistical unigram tokenizer decode_forward
#63
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Was testing out the
SentencePieceModel
tokenizer on longer text (web articles of about a few thousand characters) and noticed that tokenization was taking a long time. Taking a look at the code for thedecode_forward
pass it seems you are considering candidate spans(char_start, char_end)
with arbitrary length, even though the vocabulary will have some maximum length element. Constraining decoding to consider spans of at most this max length yields the same result, since no longer substring will be present in vocab. This change has a dramatic impact for tokenizing longer pieces of text. This PR addresses the problem by computing amax_vocab_codeunit_len
for theSentencePieceModel
to cache the longest code unit length for any vocabulary element. This field is used to truncate search for decoding.This gist below highlights the performance gap. There is existing test coverage for this code and those tests still pass, but happy to add more if there's something else to test in the implementation.
Before this PR:
7.098319 seconds (195.33 k allocations: 11.746 MiB, 1.41% compilation time)
With this PR:
0.016252 seconds (8.23 k allocations: 1.026 MiB)