This repo provides tools for evaluating the efficiency of various tokenizers for Swedish, Danish, Norwegian Bokmål and Norwegian Nynorsk. It will also support English for comparison. Here we meassure the tokenizer efficiency by tokenizing a total of 100k words from the top 500 Wikipedia pages for this language.
Tokenizer efficincy, 𝐸, can be defined as the ratio of the total number of words, 𝑊, to the total number of tokens, 𝑇, multiplied by 100 to express it as a percentage:
Tokenizer | Type | Vocab Size | en | sv | da | no | nn | Average | Tokens/Word |
---|---|---|---|---|---|---|---|---|---|
AISweedenRoberta | BPE | 50,265 | 68 | 75 | 77 | 75 | 67 | 72.8% | 1.38 |
Viking | BPE | 131,072 | 76 | 68 | 70 | 69 | 69 | 70.9% | 1.41 |
MBart | SentencePiece | 250,027 | 74 | 65 | 67 | 67 | 63 | 68.0% | 1.48 |
ScandEng | BPE | 32,000 | 67 | 60 | 65 | 67 | 66 | 65.6% | 1.53 |
Gemma | SentencePiece | 256,000 | 81 | 60 | 61 | 61 | 60 | 65.0% | 1.56 |
norMistral | BPE | 32,768 | 62 | 52 | 62 | 70 | 66 | 62.9% | 1.61 |
mT5 | SentencePiece | 250,100 | 69 | 58 | 60 | 60 | 58 | 61.7% | 1.63 |
Llama3 | BPE | 128,000 | 84 | 53 | 55 | 55 | 54 | 60.5% | 1.7 |
GPT-J | BPE | 50,257 | 89 | 46 | 49 | 50 | 48 | 56.8% | 1.87 |
NB-GPT-J | BPE | 50,257 | 89 | 46 | 49 | 50 | 48 | 56.8% | 1.87 |
Roberta | BPE | 50,265 | 89 | 46 | 49 | 50 | 48 | 56.8% | 1.87 |
GPT2 | BPE | 50,257 | 89 | 46 | 49 | 50 | 48 | 56.8% | 1.87 |
Llama | BPE | 32,000 | 71 | 50 | 49 | 49 | 49 | 54.1% | 1.89 |
MabeckMistral | WordPiece | 32,000 | 72 | 48 | 48 | 48 | 48 | 53.3% | 1.93 |
BinericGPT | WordPiece | 32,000 | 72 | 48 | 48 | 48 | 48 | 53.3% | 1.93 |
Mistral | BPE | 32,000 | 72 | 48 | 48 | 48 | 48 | 53.3% | 1.93 |
KBLab-Megatron | WordPiece | 64,005 | 52 | 61 | 45 | 45 | 45 | 50.1% | 2.02 |
Tokenizer | Type | Vocab Size | Scand Test | Nordic Test | Eng Test | Average | Tokens/Word |
---|---|---|---|---|---|---|---|
NB-BERT | WordPiece | 50,000 | OK (lower) | Failed | OK (lower) | 86.0% | 1.3 |
NorBert | WordPiece | 50,000 | Failed | Failed | Failed | 82.5% | 1.4 |
norT5 | SentencePiece | 50,000 | Failed | Failed | Failed | 82.5% | 1.4 |
mBERT | WordPiece | 105,879 | Failed | Failed | OK (lower) | 72.8% | 1.34 |
ScandEng | BPE | 32,000 | Success | Failed | Success | 67.7% | 1.53 |
KBLab-BERT | WordPiece | 50,325 | Failed | Failed | Success | 63.2% | 1.51 |
Saattrupdan-no | WordPiece | 30,000 | OK (lower) | OK (lower) | OK (lower) | 59.2% | 1.94 |
Saattrupdan-scand | WordPiece | 100,000 | OK (lower) | OK (lower) | OK (lower) | 56.3% | 1.84 |
Bert | WordPiece | 30,522 | Failed | Failed | OK (lower) | 52.3% | 1.74 |
DistilBert | WordPiece | 30,522 | Failed | Failed | OK (lower) | 52.3% | 1.74 |
LayoutLM | WordPiece | 30,522 | Failed | Failed | OK (lower) | 52.3% | 1.74 |
Saattrupdan-da | WordPiece | 30,000 | OK (lower) | OK (lower) | OK (lower) | 51.0% | 1.97 |
Saattrupdan-sv | WordPiece | 30,000 | OK (lower) | OK (lower) | OK (lower) | 43.9% | 2.1 |
XLNet | SentencePiece | 32,000 | Failed | Failed | Success | 41.0% | 2.21 |
T5 | SentencePiece | 32,100 | Failed | Failed | Success | 36.9% | 2.46 |
This script creates a corpus for Wikipedia articles for the defined set of languages. It is a tool for creating the tokenization benchmark. It extracts the first 200 words from each article on a specified date. Articles shorter than 200 words are dropped. In the default mode it samples until it has reached 100k words.
To create the corpus files, run the command below:
for lang in en no nn da sv; do python sample_wikipedia.py --language $lang --output_file wikipedia_100k/wiki_$lang.txt --num_articles 500 --num_words 200;done
for lang in en no nn da sv; do python sample_wikipedia.py --language $lang --output_file wikipedia_1k/wiki_$lang.txt --num_articles 50 --num_words 20;done
This script runs the test and creates the tables in this document.
python run_test.py
python run_test.py --directory wikipedia_1k/