GitHub - racai-ai/Rodna: Romanian Deep Neural Network Architectures project

RODNA

ROmanian Deep Neural networks Architectures (RODNA) is a Python 3/PyTorch project with the declared goal of obtaining better results at Romanian text processing through the use of Romanian-specific features than generic, language-independent ML toolkits.

Performance

Here are the accuracy figures of Rodna for sentence splitting, POS tagging and dependency parsing.

Training data

Latest version of the Romanian RRT UD corpus available at UD_Romanian-RRT.

Latest training data is pushed to this repository, but if you want to generate fresh training data, run python3 rrt_generate.py. Make sure you read the comments preceding the function def generate_ssplit_rrt_training(in_file: str, out_file: str): from rrt_generate.py first. Folder UD_Romanian-RRT must be available at ../UD/UD_Romanian-RRT relative to the folder containing this file.

Sentence splitter

A Bi-LSTM over a frozen BERT embedding neural network that does sentence splitting (classifies each token as 'end of sentence' or 'not end of sentence').

Precision on 'end of sentence' label is 99.62% on the dev split of RRT.
Recall on 'end of sentence' label is 99.41% on the dev split of RRT.
F1 on 'end of sentence' label is 99.52% on the dev split of RRT.

Precision on 'end of sentence' label is 99.65% on the test split of RRT.
Recall on 'end of sentence' label is 99.38% on the test split of RRT.
F1 on 'end of sentence' label is 99.52% on the test split of RRT.

Romanian morphology

A LSTM neural network than learns the mapping from a word form to its possible MSDs. It works on character embeddings of the input word, from left to right.

Precision on MSDs that are in the word's ambiguity class is 95.14%.
Recall of MSDs that are in the word's ambiguity class is 92.66%.
F1 of the above is 93.88%.

POS tagger

A Bi-LSTM-CRF head over a BERT embedding to get coarse-grained POS tags coupled with a Bi-LSTM head over another BERT embedding to get the MSD of the current word, given its coarse-grained POS tag. The POS tagger uses Romanian-specific features, extracted beforehand from the input sentence.

With coarse-grained to fine-grained mapping (called "tiered tagging") Accuracy on fine-grained POS tags (MSDs) of the dev set is 98.10%.
Accuracy on fine-grained POS tags (MSDs) of the test set is 97.39%.

Without tiered tagging (roughly 10 times faster) Accuracy on fine-grained POS tags (MSDs) of the dev set is 98.06%.
Accuracy on fine-grained POS tags (MSDs) of the test set is 97.54%.

UD dependency parser

A LSTM head finder over BERT embeddings and a GRU dependency labeler over BERT embeddings, labeling root-to-leaf paths in the unlabeled tree.

UAS/LAS on the dev set: 92.45%/88.07%.
UAS/LAS on the test set: 92.35%/87.65%.

Accuracy on finding the correct head of a token: 92.45% Accuracy on correctly labeling a dependency relation: 92.92%

HOWTO

Install RODNA via pip install:

pip install rodna

Use class RodnaProcessor to process raw texts and output them in the CoNLL-U format:

from rodna.api import RodnaProcessor
from conllu.models import SentenceList

rodna = RodnaProcessor()
# Output is written to path/to/file.rodna.conllu
# So .txt is replaced by .rodna.conllu
rodna.process_text_file(txt_file='path/to/file.txt')
# Returns a list of sentences in the CoNLL-U format
list_of_sentences: SentenceList = rodna.process_text(text='Aceasta este o propoziție.')

Rodna resources will be downloaded once, with the first call to RodnaProcessor(). If you want to pre-download the resources, do this:

import rodna

rodna.download_resources()

Name		Name	Last commit message	Last commit date
Latest commit History 141 Commits
data		data
rodna		rodna
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
rrt_generate.py		rrt_generate.py
stanza.py		stanza.py
train.bat		train.bat
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RODNA

Performance

Training data

Sentence splitter

Romanian morphology

POS tagger

UD dependency parser

HOWTO

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RODNA

Performance

Training data

Sentence splitter

Romanian morphology

POS tagger

UD dependency parser

HOWTO

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages