Skip to content

DmitryAsdre/UnigramTokenization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Unigram Tokenization

This is a simple Python implementation of Unigram Tokenization.

Algorithm Description

  1. Use Byte Pair Encoding (BPE) tokenizer to create an arbitrarily large vocabulary $\mathcal{V}$.
  2. Let the distribution on tokens be denoted as $p(x_i) = \frac{count(x_i)}{\sum_{j=1}^{N}count(x_j)}$.
  3. Use the hard EM algorithm to estimate the distribution $p(x_i)$:
    • Repeat these steps until convergence
      • Employ the Viterbi algorithm to find the best tokenization $\mathcal{T}$.
      • Fix the best tokenization $\mathcal{T}$ and maximize likelihood: $$P(X) = \prod_{i=1}^{N_{\mathcal{T}}} p(x_i)$$
  4. Shrink the vocabulary $\mathcal{T}$ by a multiplication factor $\alpha$:
    • Calculate the loss if token $x_i$ is replaced with the Viterbi path of token ${x_i}$.
    • Sort by loss
    • Shrink the vocabulary so that $|\mathcal{T_{new}}| = (1 - \alpha)|\mathcal{T}_{old}|$.

Source code

  • You can find Unigram tokenization realization in unigram.ipynb

References

About

Unigram Tokenization realization from scratch

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published