Benchmark for Scandinavian Language Tokenizers

This repo provides tools for evaluating the efficiency of various tokenizers for Swedish, Danish, Norwegian Bokmål and Norwegian Nynorsk. It will also support English for comparison. Here we meassure the tokenizer efficiency by tokenizing a total of 100k words from the top 500 Wikipedia pages for this language.

Tokenizer efficincy, 𝐸, can be defined as the ratio of the total number of words, 𝑊, to the total number of tokens, 𝑇, multiplied by 100 to express it as a percentage:

Scandinavian Tokenizers

Tokenizer	Type	Vocab Size	en	sv	da	no	nn	Average	Tokens/Word
AISweedenRoberta	BPE	50,265	68	75	77	75	67	72.8%	1.38
Viking	BPE	131,072	76	68	70	69	69	70.9%	1.41
MBart	SentencePiece	250,027	74	65	67	67	63	68.0%	1.48
ScandEng	BPE	32,000	67	60	65	67	66	65.6%	1.53
Gemma	SentencePiece	256,000	81	60	61	61	60	65.0%	1.56
norMistral	BPE	32,768	62	52	62	70	66	62.9%	1.61
mT5	SentencePiece	250,100	69	58	60	60	58	61.7%	1.63
Llama3	BPE	128,000	84	53	55	55	54	60.5%	1.7
GPT-J	BPE	50,257	89	46	49	50	48	56.8%	1.87
NB-GPT-J	BPE	50,257	89	46	49	50	48	56.8%	1.87
Roberta	BPE	50,265	89	46	49	50	48	56.8%	1.87
GPT2	BPE	50,257	89	46	49	50	48	56.8%	1.87
Llama	BPE	32,000	71	50	49	49	49	54.1%	1.89
MabeckMistral	WordPiece	32,000	72	48	48	48	48	53.3%	1.93
BinericGPT	WordPiece	32,000	72	48	48	48	48	53.3%	1.93
Mistral	BPE	32,000	72	48	48	48	48	53.3%	1.93
KBLab-Megatron	WordPiece	64,005	52	61	45	45	45	50.1%	2.02

Not Fully Supported Tokenizers

Tokenizer	Type	Vocab Size	Scand Test	Nordic Test	Eng Test	Average	Tokens/Word
NB-BERT	WordPiece	50,000	OK (lower)	Failed	OK (lower)	86.0%	1.3
NorBert	WordPiece	50,000	Failed	Failed	Failed	82.5%	1.4
norT5	SentencePiece	50,000	Failed	Failed	Failed	82.5%	1.4
mBERT	WordPiece	105,879	Failed	Failed	OK (lower)	72.8%	1.34
ScandEng	BPE	32,000	Success	Failed	Success	67.7%	1.53
KBLab-BERT	WordPiece	50,325	Failed	Failed	Success	63.2%	1.51
Saattrupdan-no	WordPiece	30,000	OK (lower)	OK (lower)	OK (lower)	59.2%	1.94
Saattrupdan-scand	WordPiece	100,000	OK (lower)	OK (lower)	OK (lower)	56.3%	1.84
Bert	WordPiece	30,522	Failed	Failed	OK (lower)	52.3%	1.74
DistilBert	WordPiece	30,522	Failed	Failed	OK (lower)	52.3%	1.74
LayoutLM	WordPiece	30,522	Failed	Failed	OK (lower)	52.3%	1.74
Saattrupdan-da	WordPiece	30,000	OK (lower)	OK (lower)	OK (lower)	51.0%	1.97
Saattrupdan-sv	WordPiece	30,000	OK (lower)	OK (lower)	OK (lower)	43.9%	2.1
XLNet	SentencePiece	32,000	Failed	Failed	Success	41.0%	2.21
T5	SentencePiece	32,100	Failed	Failed	Success	36.9%	2.46

sample_wikipedia.py

This script creates a corpus for Wikipedia articles for the defined set of languages. It is a tool for creating the tokenization benchmark. It extracts the first 200 words from each article on a specified date. Articles shorter than 200 words are dropped. In the default mode it samples until it has reached 100k words.

To create the corpus files, run the command below:

for lang in en no nn da sv; do python sample_wikipedia.py --language $lang --output_file wikipedia_100k/wiki_$lang.txt --num_articles 500 --num_words 200;done
for lang in en no nn da sv; do python sample_wikipedia.py --language $lang --output_file wikipedia_1k/wiki_$lang.txt --num_articles 50 --num_words 20;done

run_test.py

This script runs the test and creates the tables in this document.

python run_test.py

Faster test run

python run_test.py --directory wikipedia_1k/

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
allwiki		allwiki
images		images
sentencepiececonverter		sentencepiececonverter
wikipedia_100k		wikipedia_100k
wikipedia_1k		wikipedia_1k
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
run_test.py		run_test.py
sample_wikipedia.py		sample_wikipedia.py
tokenizer_list.jsonl		tokenizer_list.jsonl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmark for Scandinavian Language Tokenizers

Scandinavian Tokenizers

Not Fully Supported Tokenizers

sample_wikipedia.py

run_test.py

Faster test run

About

Releases

Packages

Languages

License

NbAiLab/tokenizer-benchmark

Folders and files

Latest commit

History

Repository files navigation

Benchmark for Scandinavian Language Tokenizers

Scandinavian Tokenizers

Not Fully Supported Tokenizers

sample_wikipedia.py

run_test.py

Faster test run

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages