Skip to content

RoDmitry/langram

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

56 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Langram - the most accurate language detection library

Crate API

314 ScriptLanguages (187 models + 127 single language scripts)

One language can be written in multiple scripts, so it will be detected as a different ScriptLanguage (language + script)

Uses alphabet_detector as a word separator + language prefilter.

Based on chars (1 - 5) and 1 word n-gram language model modified algorithm.

ModelsStorage with all models preloaded uses around 4.1GB of RAM (2.4GB using max_trigrams). There can be a way (unimplemented) to unload each language model after use, it will work slower but will use around 300MB of RAM. Or maybe can use some DB for models storage on disk, rather than a HashMap in RAM.

This library is a complete rewrite of Lingua: much faster, more accuracy, more languages, etc.

Accuracy report

Comparison with other language detectors

Setup

To use it, you need to patch langram_models in Cargo.toml:

  • From Git:
[patch.crates-io]
langram_models = { git = "https://github.com/RoDmitry/langram_models.git" }
[patch.crates-io]
langram_models = { path = "../langram_models" }

Which is more advanced and allows you to remove model ngrams, so that final executable would be lighter.

About

Natural language detection library

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Languages