Skip to content

Commit

Permalink
Merge pull request #8 from LlmKira/dev-2
Browse files Browse the repository at this point in the history
feat: No `numpy` required
  • Loading branch information
sudoskys authored Jan 9, 2025
2 parents bbe4888 + 1db6c48 commit 88f88fb
Show file tree
Hide file tree
Showing 6 changed files with 411 additions and 417 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -159,4 +159,4 @@ cython_debug/
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
.idea/
50 changes: 26 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,21 +6,20 @@

## Overview

**fast-langdetect** provides ultra-fast and highly accurate language detection based on FastText, a library developed by
Facebook. This package is 80x faster than traditional methods and offers 95% accuracy.
**`fast-langdetect`** is an ultra-fast and highly accurate language detection library based on FastText, a library developed by Facebook. Its incredible speed and accuracy make it 80x faster than conventional methods and deliver up to 95% accuracy.

It supports Python versions 3.9 to 3.12.
- Supported Python `3.9` to `3.12`.
- Works offline in low memory mode
- No `numpy` required (thanks to @dalf).

Support offline usage.
> ### Background
>
> This project builds upon [zafercavdar/fasttext-langdetect](https://github.com/zafercavdar/fasttext-langdetect#benchmark) with enhancements in packaging.
> For more information about the underlying model, see the official FastText documentation: [Language Identification](https://fasttext.cc/docs/en/language-identification.html).
This project builds upon [zafercavdar/fasttext-langdetect](https://github.com/zafercavdar/fasttext-langdetect#benchmark)
with enhancements in packaging.

For more information on the underlying FastText model, refer to the official
documentation: [FastText Language Identification](https://fasttext.cc/docs/en/language-identification.html).

> [!NOTE]
> This library requires over 200MB of memory to use in low memory mode.
> ### Possible memory usage
>
> *This library requires at least **200MB memory** in low-memory mode.*
## Installation 💻

Expand All @@ -40,20 +39,17 @@ pdm add fast-langdetect

## Usage 🖥️

For optimal performance and accuracy in language detection, use `detect(text, low_memory=False)` to load the larger
model.
In scenarios where accuracy is important, you should not rely on the detection results of small models, use `low_memory=False` to download larger models!

> The model will be downloaded to the `/tmp/fasttext-langdetect` directory upon first use.
### Prerequisites

### Native API (Recommended)
- The “/n” character in the argument string must be removed before calling the function.
- If the sample is too long or too short, the accuracy will be reduced (e.g. if it is too short, Chinese will be predicted as Japanese).
- The model will be downloaded to the `/tmp/fasttext-langdetect` directory upon first use.

> [!NOTE]
> This function assumes to be given a single line of text. *You should remove `\n` characters before passing the text.*
> If the sample is too long or too short, the accuracy will decrease (for example, in the case of too short, Chinese
> will be predicted as Japanese).
### Native API (Recommended)

```python

from fast_langdetect import detect, detect_multilingual

# Single language detection
Expand All @@ -69,15 +65,17 @@ multiline_text = """
Hello, world!
This is a multiline text.
But we need remove `\n` characters or it will raise an ValueError.
REMOVE \n
"""
multiline_text = multiline_text.replace("\n", "") # NOTE:ITS IMPORTANT TO REMOVE \n CHARACTERS
multiline_text = multiline_text.replace("\n", "")
print(detect(multiline_text))
# Output: {'lang': 'en', 'score': 0.8509423136711121}

print(detect("Привет, мир!")["lang"])
# Output: ru

# Multi-language detection
# Multi-language detection with low memory mode enabled
# The accuracy is not as good as it should be
print(detect_multilingual("Hello, world!你好世界!Привет, мир!"))
# Output: [{'lang': 'ja', 'score': 0.32009604573249817}, {'lang': 'uk', 'score': 0.27781224250793457}, {'lang': 'zh', 'score': 0.17542070150375366}, {'lang': 'sr', 'score': 0.08751443773508072}, {'lang': 'bg', 'score': 0.05222449079155922}]

Expand All @@ -86,6 +84,10 @@ print(detect_multilingual("Hello, world!你好世界!Привет, мир!", low
# Output: [{'lang': 'ru', 'score': 0.39008623361587524}, {'lang': 'zh', 'score': 0.18235979974269867}, {'lang': 'ja', 'score': 0.08473210036754608}, {'lang': 'sr', 'score': 0.057975586503744125}, {'lang': 'en', 'score': 0.05422825738787651}]
```

#### Fallbacks

We provide a fallback mechanism: when `use_strict_mode=False`, if the program fails to load the **large model** (`low_memory=False`), it will fall back to the offline **small model** to complete the prediction task.

### Convenient `detect_language` Function

```python
Expand Down Expand Up @@ -135,4 +137,4 @@ models
journal={arXiv preprint arXiv:1612.03651},
year={2016}
}
```
```
59 changes: 59 additions & 0 deletions feature_test/lingua_t.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
from lingua import LanguageDetectorBuilder

from fast_langdetect import detect_language, detect_multilingual

low_mem_detector = (LanguageDetectorBuilder
.from_all_languages()
.with_low_accuracy_mode()
.with_preloaded_language_models()
.build())
detector = (LanguageDetectorBuilder
.from_all_languages()
.with_preloaded_language_models()
.build())
ja_sentence = "こんにちは世界"
print(detect_language(ja_sentence))
print(low_mem_detector.detect_language_of(ja_sentence).iso_code_639_1.name)
print("===")
ko_sentence = "안녕하세요 세계"
print(detect_language(ko_sentence))
print(low_mem_detector.detect_language_of(ko_sentence).iso_code_639_1.name)
print("===")
fr_sentence = "Bonjour le monde"
print(detect_language(fr_sentence))
print(low_mem_detector.detect_language_of(fr_sentence).iso_code_639_1.name)
print("===")
de_sentence = "Hallo Welt"
print(detect_language(de_sentence))
print(low_mem_detector.detect_language_of(de_sentence).iso_code_639_1.name)
print("===")
zh_sentence = "這些機構主辦的課程,多以基本電腦使用為主,例如文書處理、中文輸入、互聯網應用等"
print(detect_language(zh_sentence))
print(low_mem_detector.detect_language_of(zh_sentence).iso_code_639_1.name)
print("===")
es_sentence = "Hola mundo"
print(detect_language(es_sentence))
print(low_mem_detector.detect_language_of(es_sentence).iso_code_639_1.name)
print("===")

sentence = "こんにちは世界"
for result in detector.detect_multiple_languages_of(sentence):
print(result.language)
print("===")
sentence = """
こんにちは世界
안녕하세요 세계
Hallo Welt
這些機構主辦的課程,多以基本電腦使用為主,例如文書處理、中文輸入、互聯網應用等
Bonjour le monde
"""
langs = detect_multilingual(sentence.replace("\n", " "), low_memory=False)
for lang in langs:
print(lang)
confidence_values = detector.compute_language_confidence_values(sentence)
for confidence in confidence_values:
if confidence.value > 0:
print(f"{confidence.language.iso_code_639_1.name}: {confidence.value:.2f}")
print("===")
for result in low_mem_detector.detect_multiple_languages_of(sentence):
print(result.language.iso_code_639_1.name)
Loading

0 comments on commit 88f88fb

Please sign in to comment.