Intro

Rio (spanish for river) is a library for using backtranslation or round trip translation to do text pre-processing, filtering, and augmentation. It is intended to be used to process text datasets for training NLP models. This is based on the original Muliwai repo but the PII code has been refactored to live in its own repo at https://www.github.com/piisa/muliwai. Rio no longer does PII processing. Please use https://www.github.com/piisa/muliwai instead.

Installing

If you want to be able to do gender detection and coref detection, you will need to load neuralcoref below. However, you will only be able to use spacy english if you load neural coref. You can also load a larger spacy model for more accuracy but more memory.

git clone https://github.com/ontocord/rio
pip install https://github.com/kpu/kenlm/archive/master.zip
pip install spacy==2.1.0 regex==2022.3.2 dateparser python-stdnum protobuf neuralcoref cdifflib transformers datasets langid faker sentencepiece fsspec tqdm sentence-transformers nltk
python -m nltk.downloader punkt wordnet

License

The source code authored by Ontocord LLC and contributed by contributors of this project is licensed under Apache 2.0.

Contributors

We welcome all contributions. Please feel free to send a PR. Please follow the code of conduct: https://github.com/ontocord/rio/blob/main/CODE_OF_CONDUCT.md Special thanks to these people not just for code contributions but for comments and reviews (in no particular order) from the original Muliwai repo:

@dadelani
@edugp
@vumichien
@ianyu93
@j-chim
@justinphan3110
@mapama247
@paulovn
@PierreColombo
@piesauce
@mmitchellai
@shamikbose

Acknowledgements

We heavily use the models trained by @dadelani and the excelent work by https://github.com/masakhane-io.

Name		Name	Last commit message	Last commit date
Latest commit History 808 Commits
bin		bin
.gitignore		.gitignore
.gitmodules		.gitmodules
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
cjk.py		cjk.py
country_2_lang.py		country_2_lang.py
default_onto_tags.py		default_onto_tags.py
dictionary_manager.py		dictionary_manager.py
fake_names.py		fake_names.py
faker_manager.py		faker_manager.py
marian_mt.py		marian_mt.py
preprocess.py		preprocess.py
process.py		process.py
qg_pipeline.py		qg_pipeline.py
requirements.txt		requirements.txt
requirements_pierre_spacy.txt		requirements_pierre_spacy.txt
setup_on_jeanzay.md		setup_on_jeanzay.md
stopwords.py		stopwords.py
test.py		test.py
test_all_languages.sh		test_all_languages.sh
text_augment.py		text_augment.py
translation_manager.py		translation_manager.py
vi_oscal.slurm		vi_oscal.slurm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Intro

Installing

License

Contributors

Acknowledgements

About

Releases

Packages

Contributors 8

Languages

License

huu4ontocord/rio

Folders and files

Latest commit

History

Repository files navigation

Intro

Installing

License

Contributors

Acknowledgements

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 8

Languages

Packages