One language can be written in multiple scripts, so it will be detected as a different
ScriptLanguage
(language + script)
Uses alphabet_detector
as a word separator + language prefilter.
Based on chars (1 - 5) and 1 word n-gram language model modified algorithm.
ModelsStorage
with all models preloaded uses around 4.1GB of RAM (2.4GB using max_trigrams). There can be a way (unimplemented) to unload each language model after use, it will work slower but will use around 300MB of RAM. Or maybe can use some DB for models storage on disk, rather than a HashMap in RAM.
This library is a complete rewrite of Lingua: much faster, more accuracy, more languages, etc.
Comparison with other language detectors
To use it, you need to patch langram_models
in Cargo.toml
:
- From Git:
[patch.crates-io]
langram_models = { git = "https://github.com/RoDmitry/langram_models.git" }
- From predownloaded copy (langram_models):
[patch.crates-io]
langram_models = { path = "../langram_models" }
Which is more advanced and allows you to remove model ngrams, so that final executable would be lighter.