Skip to content

Commit fef4488

Browse files
authored
Merge pull request #3 from TartuNLP/dev
Updated to estnltk 1.7; introduced pip package; improved code readability and preprocessing rules; added initial accessibility mode
2 parents 90f99e8 + 0dfb05d commit fef4488

File tree

10 files changed

+11100
-653
lines changed

10 files changed

+11100
-653
lines changed

.gitignore

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,6 @@
1+
build
12
/.idea/
2-
__pycache__/
3+
__pycache__/
4+
tts_preprocess_et/__pycache__/
5+
tts_preprocess_et.egg-info/
6+
test.py

MANIFEST.in

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
include tts_preprocess_et/data/names.txt

assets.py

Lines changed: 0 additions & 278 deletions
This file was deleted.

readme.md

Lines changed: 33 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -7,24 +7,41 @@ The script follows the rules of Estonian orthography. Internationally used forms
77
- Ranges use a dash (not a hyphen).
88
- Long numbers are grouped by spaces (not commas or dots)
99
- Dashes between numbers that are separated by spaces are considered to be minuses, otherwise they are ranges
10-
- Decimal fractions use commas (not dots)
10+
- Decimal fractions use commas (not dots)
11+
1112

1213
### Requirement:
13-
- Python 3
14-
- EstNLTK (>= 1.6.0)
14+
- Python (>= 3.7)
15+
- EstNLTK (>= 1.7.0)
1516

16-
### Features
1717

18+
### Usage
19+
Install from command line:\
20+
`pip install git+https://github.com/TartuNLP/tts_preprocess_et@main`
21+
22+
Add to project (requirements.txt):\
23+
`git+https://github.com/TartuNLP/tts_preprocess_et@main`
24+
25+
Import:\
26+
`from tts_preprocess_et.convert import convert_sentence`
27+
28+
Processing a sentence:\
29+
`processed_sentence = convert_sentence(sentence_string)`
30+
31+
Processing with accessibility mode:\
32+
`processed_sentence = convert_sentence(sentence_string, accessibility=True)`
1833

19-
- [x] Converting Arabian numbers to words, including numbers that are grouped with spaces. Example: `10 000 → kümme
20-
tuhat`
34+
35+
### Features
36+
37+
- [x] Converting Arabian numbers to words, including numbers that are grouped with spaces. Example: `10 000 → kümme tuhat`
2138
- [x] Detecting ordinal numbers and converting them. Example: `1. → esimene`
2239
- [x] Detecting the correct case from a suffix. Example: `1-le → ühele`
2340
- [x] Numbers followed by a *-ne*/*-line* adjective. Example: `1-aastane → ühe aastane`
2441
- [x] Detecting the correct case from upcoming words and their forms. Example: `1 sõbrale → ühele sõbrale; 1 sõbraga → ühe sõbraga`
2542
- [x] Detecting the correct case from adpositions. Example: `üle 1 → üle ühe; 1 võrra → ühe võrra`
2643
- [x] Detecting and converting Roman numerals. Example: `I → esimene`
27-
- [x] Declining numbers in simple lists. Example: `1. ja 2. juunil → esimesel ja teisel juunil`
44+
- [x] Detecting numbers in simple lists. Examples: `1. ja 2. juunil → esimesel ja teisel juunil`, `II, III või IV liigast → teisest, kolmandast või neljandast liigast`
2845
- [x] Converting and declining audible symbols. Example: `% → protsent`
2946
- [x] Converting and declining common abbreviations. Example: `jne → ja nii edasi`
3047
- [x] Converting simple mathematic equations. Example: `1 + 1 = 2 → üks pluss üks võrdub kaks`
@@ -55,6 +72,15 @@ The script follows the rules of Estonian orthography. Internationally used forms
5572
`</del>` bee`
5673
- [ ] Censored words should not be interpreted as abbreviations. Example: `p***e → `<del>`punkt***ehk`</del>
5774
- [ ] Detecting abbreviated *-ne*/*-line* adjectives. Example: `5 km vahemaa → `<del>`viis kilomeetrit`</del>` viie kilomeetrine vahemaa`
75+
- [x] Detecting relative units. Example: `kg/m³ → kilogrammi kuupmeetri kohta`
5876
- [x] Numbers larger than 10^27
77+
- [x] Detecting and pronouncing alphanumeric codes. Example: `2KMc7hy → kaks, kaa, emm, tsee, seitse, haa, igrek`
5978
- [ ] Detection of which consonant combinations can be pronounced and which need to be spelled out letter by letter (`ERM` vs `ERR`). Useful for abbreviations and URLs.
6079
- [ ] Handling unmapped use cases: post-processing to make sure that all information that remains in the output is readable for speech synthesis and potentially creating a more robust mode where everything is always converted but disregarding the correct form.
80+
- Detecting and reading out brackets ('()', '[]', '{}') in a sentence. Example: `2. koht (hõbe) → Teine koht, sulgudes hõbe`
81+
82+
83+
Accessibility mode:
84+
- [x] Differentiating capital letters in alphanumeric codes. Example: `2KMc7hy → kaks, suur-täht-kaa, suur-täht-emm, tsee, seitse, haa, igrek`
85+
- [x] Reading out exclamation and question marks. Example: `Appi! → Appi hüüumärk`
86+
- [x] Reading out bracket endings. Example: `2. koht (hõbe) → Teine koht, sulgudes hõbe, sulu lõpp`

setup.py

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
from setuptools import setup, find_packages
2+
3+
setup(
4+
name='tts_preprocess_et',
5+
version='1.0.0',
6+
packages=find_packages(),
7+
license='MIT',
8+
description='Preprocessing for Estonian text-to-speech applications',
9+
long_description=open('readme.md').read(),
10+
install_requires=['estnltk>=1.7.0'],
11+
include_package_data=True,
12+
url='https://github.com/TartuNLP/tts_preprocess_et',
13+
author='TartuNLP',
14+
author_email='[email protected]'
15+
)

tts_preprocess_et/__init__.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
from .convert import convert_sentence
2+
import nltk
3+
try:
4+
nltk.data.find('tokenizers/punkt_tab')
5+
except Exception:
6+
nltk.download('punkt_tab')

0 commit comments

Comments
 (0)