Optimize statistical unigram tokenizer `decode_forward` #63

aria42 · 2021-12-30T22:37:58Z

Was testing out the SentencePieceModel tokenizer on longer text (web articles of about a few thousand characters) and noticed that tokenization was taking a long time. Taking a look at the code for the decode_forward pass it seems you are considering candidate spans (char_start, char_end) with arbitrary length, even though the vocabulary will have some maximum length element. Constraining decoding to consider spans of at most this max length yields the same result, since no longer substring will be present in vocab. This change has a dramatic impact for tokenizing longer pieces of text. This PR addresses the problem by computing a max_vocab_codeunit_len for the SentencePieceModel to cache the longest code unit length for any vocabulary element. This field is used to truncate search for decoding.

This gist below highlights the performance gap. There is existing test coverage for this code and those tests still pass, but happy to add more if there's something else to test in the implementation.

using HTTP
using WordTokenizers

spm = load(ALBERT_V1)
# Download Hamlet text and truncate to roughly first 5k bytes
long_text = String(HTTP.get("https://dlg.usg.edu/record/dlg_zlgb_gb5027/fulltext.text").body)
max_len = 5000
long_text = long_text[begin:thisind(long_text, max_len)]
@time spm(long_text)

Before this PR: 7.098319 seconds (195.33 k allocations: 11.746 MiB, 1.41% compilation time)
With this PR: 0.016252 seconds (8.23 k allocations: 1.026 MiB)

…ntencePieceModel vocab

aria42 · 2022-01-07T04:27:05Z

Thanks @aviks assuming this means I can merge PR, or do I need another approver?

aria42 · 2022-01-24T23:57:15Z

@aviks I think you need to merge since I can't do it myself.

truncate decode_forward pass to skip spans longer than any word in Se…

7931208

…ntencePieceModel vocab

aviks approved these changes Jan 4, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize statistical unigram tokenizer `decode_forward` #63

Optimize statistical unigram tokenizer `decode_forward` #63

aria42 commented Dec 30, 2021

aria42 commented Jan 7, 2022 •

edited

Loading

aria42 commented Jan 24, 2022

Optimize statistical unigram tokenizer decode_forward #63

Are you sure you want to change the base?

Optimize statistical unigram tokenizer decode_forward #63

Conversation

aria42 commented Dec 30, 2021

aria42 commented Jan 7, 2022 • edited Loading

aria42 commented Jan 24, 2022

Optimize statistical unigram tokenizer `decode_forward` #63

Optimize statistical unigram tokenizer `decode_forward` #63

aria42 commented Jan 7, 2022 •

edited

Loading