Toiro is a comparison tool of Japanese tokenizers.
- Compare the processing speed of tokenizers
- Compare the words segmented in tokenizers
- Compare the performance of tokenizers by benchmarking application tasks (e.g., text classification)
It also provides useful functions for natural language processing in Japanese.
- Data downloader for Japanese text corpora
- Preprocessor of these corpora
- Text classifier for Japanese text (e.g., SVM, BERT)
Python 3.6+ is required. You can install toiro with the following command. Janome is included in the default installation.
pip install toiro
If you want to add a tokenizer to toiro, please install it individually. This is an example of adding SudachiPy and nagisa to toiro.
pip install sudachipy sudachidict_core
pip install nagisa
How to install other tokenizers
pip install mecab-python3
pip install spacy ginza
pip install spacy[ja]
You need to install KyTea. Please refer to here.
pip install kytea
You need to install Juman++ v2. Please refer to here.
pip install pyknp
pip install sentencepiece
pip install fugashi ipadic
pip install fugashi unidic-lite
pip install tinysegmenter3
If you want to install all the tokonizers at once, please use the following command.
pip install toiro[all_tokenizers]
You can check the available tokonizers in your Python environment.
from toiro import tokenizers
available_tokenizers = tokenizers.available_tokenizers()
print(available_tokenizers)
Toiro supports 12 different Japanese tokonizers. This is an example of adding SudachiPy and nagisa.
{'nagisa': {'is_available': True, 'version': '0.2.7'},
'janome': {'is_available': True, 'version': '0.3.10'},
'mecab-python3': {'is_available': False, 'version': False},
'sudachipy': {'is_available': True, 'version': '0.4.9'},
'spacy': {'is_available': False, 'version': False},
'ginza': {'is_available': False, 'version': False},
'kytea': {'is_available': False, 'version': False},
'jumanpp': {'is_available': False, 'version': False},
'sentencepiece': {'is_available': False, 'version': False},
'fugashi-ipadic': {'is_available': False, 'version': False},
'fugashi-unidic': {'is_available': False, 'version': False},
'tinysegmenter': {'is_available': False, 'version': False}}
Download the livedoor news corpus and compare the processing speed of tokenizers.
from toiro import tokenizers
from toiro import datadownloader
# A list of avaliable corpora in toiro
corpora = datadownloader.available_corpus()
print(corpora)
#=> ['livedoor_news_corpus', 'yahoo_movie_reviews', 'amazon_reviews']
# Download the livedoor news corpus and load it as pandas.DataFrame
corpus = corpora[0]
datadownloader.download_corpus(corpus)
train_df, dev_df, test_df = datadownloader.load_corpus(corpus)
texts = train_df[1]
# Compare the processing speed of tokenizers
report = tokenizers.compare(texts)
#=> [1/3] Tokenizer: janome
#=> 100%|███████████████████| 5900/5900 [00:07<00:00, 746.21it/s]
#=> [2/3] Tokenizer: nagisa
#=> 100%|███████████████████| 5900/5900 [00:15<00:00, 370.83it/s]
#=> [3/3] Tokenizer: sudachipy
#=> 100%|███████████████████| 5900/5900 [00:08<00:00, 696.68it/s]
print(report)
{'execution_environment': {'python_version': '3.7.8.final.0 (64 bit)',
'arch': 'X86_64',
'brand_raw': 'Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz',
'count': 8},
'data': {'number_of_sentences': 5900, 'average_length': 37.69593220338983},
'janome': {'elapsed_time': 9.114670515060425},
'nagisa': {'elapsed_time': 15.873093605041504},
'sudachipy': {'elapsed_time': 9.05256724357605}}
# Compare the words segmented in tokenizers
text = "都庁所在地は新宿区。"
tokenizers.print_words(text, delimiter="|")
#=> janome: 都庁|所在地|は|新宿|区|。
#=> nagisa: 都庁|所在|地|は|新宿|区|。
#=> sudachipy: 都庁|所在地|は|新宿区|。
You can use all tokenizers by building a docker container from Docker Hub.
docker run --rm -it taishii/toiro /bin/bash
How to run the Python interpreter in the Docker container
Run the Python interpreter.
root@cdd2ad2d7092:/workspace# python3
Compare the words segmented in tokenizers
>>> from toiro import tokenizers
>>> text = "都庁所在地は新宿区。"
>>> tokenizers.print_words(text, delimiter="|")
mecab-python3: 都庁|所在地|は|新宿|区|。
janome: 都庁|所在地|は|新宿|区|。
nagisa: 都庁|所在|地|は|新宿|区|。
sudachipy: 都庁|所在地|は|新宿区|。
spacy: 都庁|所在|地|は|新宿|区|。
ginza: 都庁|所在地|は|新宿区|。
kytea: 都庁|所在|地|は|新宿|区|。
jumanpp: 都庁|所在|地|は|新宿|区|。
sentencepiece: ▁|都|庁|所在地|は|新宿|区|。
fugashi-ipadic: 都庁|所在地|は|新宿|区|。
fugashi-unidic: 都庁|所在|地|は|新宿|区|。
tinysegmenter: 都庁所|在地|は|新宿|区|。
The slides at PyCon JP 2020
Tutorials in Japanese
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.