nlpaug

This python library helps you with augmenting nlp for your machine learning projects. Visit this introduction to understand about Data Augmentation in NLP. Augmenter is the basic element of augmentation while Flow is a pipeline to orchestra multi augmenter together.

Features

Generate synthetic data for improving model performance without manual effort
Simple, easy-to-use and lightweight library. Augment data in 3 lines of code
Plug and play to any machine leanring/ neural network frameworks (e.g. scikit-learn, PyTorch, TensorFlow)
Support textual and audio input

Textual Data Augmentation Example

Acoustic Data Augmentation Example

Section	Description
Quick Demo	How to use this library
Augmenter	Introduce all available augmentation methods
Installation	How to install this library
Recent Changes	Latest enhancement
Extension Reading	More real life examples or researchs
Reference	Reference of external resources such as data or model

Quick Demo

Quick Example
Example of Augmentation for Textual Inputs
Example of Augmentation for Multilingual Textual Inputs
Example of Augmentation for Spectrogram Inputs
Example of Augmentation for Audio Inputs
Example of Orchestra Multiple Augmenters
Example of Showing Augmentation History
How to train TF-IDF model
How to train LAMBADA model
How to create custom augmentation
API Documentation

Augmenter

Augmenter	Target	Augmenter	Action	Description
Textual	Character	KeyboardAug	substitute	Simulate keyboard distance error
Textual		OcrAug	substitute	Simulate OCR engine error
Textual		RandomAug	insert, substitute, swap, delete	Apply augmentation randomly
Textual	Word	AntonymAug	substitute	Substitute opposite meaning word according to WordNet antonym
Textual		ContextualWordEmbsAug	insert, substitute	Feeding surroundings word to BERT, DistilBERT, RoBERTa or XLNet language model to find out the most suitlabe word for augmentation
Textual		RandomWordAug	swap, crop, delete	Apply augmentation randomly
Textual		SpellingAug	substitute	Substitute word according to spelling mistake dictionary
Textual		SplitAug	split	Split one word to two words randomly
Textual		SynonymAug	substitute	Substitute similar word according to WordNet/ PPDB synonym
Textual		TfIdfAug	insert, substitute	Use TF-IDF to find out how word should be augmented
Textual		WordEmbsAug	insert, substitute	Leverage word2vec, GloVe or fasttext embeddings to apply augmentation
Textual		BackTranslationAug	substitute	Leverage two translation models for augmentation
Textual		ReservedAug	substitute	Replace reserved words
Textual	Sentence	ContextualWordEmbsForSentenceAug	insert	Insert sentence according to XLNet, GPT2 or DistilGPT2 prediction
Textual		AbstSummAug	substitute	Summarize article by abstractive summarization method
Textual		LambadaAug	substitute	Using language model to generate text and then using classification model to retain high quality results
Signal	Audio	CropAug	delete	Delete audio's segment
Signal		LoudnessAug	substitute	Adjust audio's volume
Signal		MaskAug	substitute	Mask audio's segment
Signal		NoiseAug	substitute	Inject noise
Signal		PitchAug	substitute	Adjust audio's pitch
Signal		ShiftAug	substitute	Shift time dimension forward/ backward
Signal		SpeedAug	substitute	Adjust audio's speed
Signal		VtlpAug	substitute	Change vocal tract
Signal		NormalizeAug	substitute	Normalize audio
Signal		PolarityInverseAug	substitute	Swap positive and negative for audio
Signal	Spectrogram	FrequencyMaskingAug	substitute	Set block of values to zero according to frequency dimension
Signal		TimeMaskingAug	substitute	Set block of values to zero according to time dimension
Signal		LoudnessAug	substitute	Adjust volume

Flow

Augmenter	Augmenter	Description
Pipeline	Sequential	Apply list of augmentation functions sequentially
Pipeline	Sometimes	Apply some augmentation functions randomly

Installation

The library supports python 3.5+ in linux and window platform.

To install the library:

pip install numpy requests nlpaug

or install the latest version (include BETA features) from github directly

pip install numpy git+https://github.com/makcedward/nlpaug.git

or install over conda

conda install -c makcedward nlpaug

If you use BackTranslationAug, ContextualWordEmbsAug, ContextualWordEmbsForSentenceAug and AbstSummAug, installing the following dependencies as well

pip install torch>=1.6.0 transformers>=4.0.0 sentencepiece

If you use LambadaAug, installing the following dependencies as well

pip install simpletransformers>=0.61.10

If you use AntonymAug, SynonymAug, installing the following dependencies as well

pip install nltk>=3.4.5

If you use WordEmbsAug (word2vec, glove or fasttext), downloading pre-trained model first

from nlpaug.util.file.download import DownloadUtil
DownloadUtil.download_word2vec(dest_dir='.') # Download word2vec model
DownloadUtil.download_glove(model_name='glove.6B', dest_dir='.') # Download GloVe model
DownloadUtil.download_fasttext(model_name='wiki-news-300d-1M', dest_dir='.') # Download fasttext model

If you use SynonymAug (PPDB), downloading file from the following URI. You may not able to run the augmenter if you get PPDB file from other website

http://paraphrase.org/#/download

If you use PitchAug, SpeedAug and VtlpAug, installing the following dependencies as well

pip install librosa>=0.7.1 matplotlib

Recent Changes

1.1.8dev, Aug, 2021

Added RandomSentAug
Added skip_check parameter for WordEmbsAug

See changelog for more details.

Extension Reading

Reference

This library uses data (e.g. capturing from internet), research (e.g. following augmenter idea), model (e.g. using pre-trained model) See data source for more details.

Citing

@misc{ma2019nlpaug,
  title={NLP Augmentation},
  author={Edward Ma},
  howpublished={https://github.com/makcedward/nlpaug},
  year={2019}
}

Book cited nlpaug

S. Vajjala, B. Majumder, A. Gupta and H. Surana. Practical Natural Language Processing: A Comprehensive Guide to Building Real-World NLP Systems. 2020
A. Bartoli and A. Fusiello. Computer Vision–ECCV 2020 Workshops. 2020

Research paper cited nlpaug

M. Raghu and E. Schmidt. A Survey of Deep Learning for Scientific Discovery. 2020
H. Guan, J. Li, H. Xu and M. Devarakonda. Robustly Pre-trained Neural Model for Direct Temporal Relation Extraction. 2020
X. He, K. Zhao and X. Chu. AutoML: A Survey of the State-of-the-Art. 2020
S. Illium, R. Muller, A. Sedlmeier and C. Linnhoff-Popien. Surgical Mask Detection with Convolutional Neural Networks and Data Augmentations on Spectrograms. 2020
D. Niederhut. A Python package for text data enrichment. 2020
P. Ryan, S. Takafuji, C. Yang, N. Wilson and C. McBride. Using Self-Supervised Learning of Birdsong for Downstream Industrial Audio Classification. 2020
Z. Shao, J. Yang and S. Ren. Calibrating Deep Neural Network Classifiers on Out-of-Distribution Datasets. 2020
S. Qiu, B. Xu, J. Zhang, Y. Wang, X. Shen, G. D. Melo, C. Long and X. Li EasyAug: An Automatic Textual Data Augmentation Platform for Classification Tasks. 2020
D. Nguyen, Q. H. Nguyen, M. Dao, D. Dang-Nguyen, C. Gurrin and B. T. Nguyen. Duplicate Identification Algorithms in SaaS Platforms. 2020
A. Ollagnier and H. Williams. Text Augmentation Techniques for Clinical Case Classification. 2020
V. Atliha and D. Šešok. Text Augmentation Using BERT for Image Captioning. 2020
Y. Ma, X. Xu, and Y. Li. LungRN+NL: An Improved Adventitious Lung Sound Classification Using non-local block ResNet Neural Network with Mixup Data Augmentation. 2020
S. N. Zisad, M. Shahadat and K. Andersson. Speech emotion recognition in neurological disorders using Convolutional Neural Network. 2020
M. Bhange and N. Kasliwal. HinglishNLP: Fine-tuned Language Models for Hinglish Sentiment Detection. 2020
T. Deruyttere, S. Vandenhende, D. Grujicic, Y. Liu, L. V. Gool, M. Blaschko, T. v and M. Moens. Commands 4 Autonomous Vehicles (C4AV) Workshop Summary. 2020
A. Tamkin, M. Wu and N. Goodman. Viewmaker Networks: Learning Views for Unsupervised Representation Learning. 2020
A. Spiegel, V. Cheong, J E. Kaplan and A. Sanchez. MK-SQUIT: Synthesizing Questions using Iterative Template-Filling. 2020
C. Zuo, N. Acharya and R. Banerjee. Querying Across Genres for Medical Claims in News. 2020
A. Sengupta. DATAMAFIA at WNUT-2020 Task 2: A Study of Pre-trained Language Models along with Regularization Techniques for Downstream Tasks. 2020
V. Awatramani and A. Kumar. Linguist Geeks on WNUT-2020 Task 2: COVID-19 Informative Tweet Identification using Progressive Trained Language Models and Data Augmentation. 2020
S. Gerani1, R. Tissot, A Ying, J. Redmon, A. Rimando and R. Hun. Reducing suicide contagion effect by detecting sentences from media reports with explicit methods of suicide. 2020
B. Velichkov, S. Gerginov, P. Panayotov, S. Vassileva, G. Velchev, I. Koyche and S. Boytcheva. Automatic ICD-10 codes association to diagnosis: Bulgarian case. 2020
T. Li, X. Chen, S. Zhang, Z. Dong and K. Keutzer. Cross-Domain Sentiment Classification with In-Domain Contrastive Learning. 2020
J. Mizgajski, A. Szymczak, M. Morzy, Ł. Augustyniak, P. Szymański and P. Żelasko. [Return on Investment in Machine Learning: Crossing the Chasm between Academia and Busines]
K. Goel, N. Rajani, J. Vig, S. Tan, J. Wu, S. Zheng, C. Xiong, M. Bansal and C. Ré. Robustness Gym: Unifying the NLP Evaluation Landscape. 2021
M. Xu, F. Zhang, X. Cui and W. Zhang. Speech Emotion Recognition with Multiscale Area Attention and Data Augmentationon. 2021
M. Ciolino, D. Noever and J. Kalin. Multilingual Augmenter: The Model Chooses. 2021
F. D. Pereira, F. Pires, S. C. Fonseca, E. H. T. Oliveira, L. S. G. Carvalho, D. B. F. Oliveira and A. I. Cristea. Towards a Human-AI Hybrid System for Categorising Programming Problems. 2021
D. Zhang, F. Nan, X. Wei, D. Li, H. Zhu, K. McKeown, R. Nallapati, A. Arnold and B. Xiang. Supporting Clustering with Contrastive Learning. 2021
L. Zhu and T. Gosakti. Augmenting Harper Valley Bank: Robust Automatic Speech Recognition. 2021
P. Ruas, V. D. T. Andrade and F. M. Couto. Lasige-BioTM at ProfNER: BiLSTM-CRF and contextual Spanish embeddings for Named Entity Recognition and Tweet Binary Classification. 2021
V. d Pimpalkhute, P. Nakhate and T. Diwan. IIITN NLP at SMM4H 2021 Tasks: Transformer Models for Classification of Health-Related Tweets. 2021
A. F. Aji, H. A. Wibowo, M. N. Nityasya, R. E. Prasojo and T. N Fatyanosa. BERT Goes Brrr: A Venture Towards the Lesser Error in Classifying Medical Self-Reporters on Twitter. 2021
V. Kovatchev, P. Smith, M. Lee, and R Devin. Can vectors read minds better than experts? Comparing data augmentation strategies for the automated scoring of children’s mindreading ability. 2021
D. R. Beddiar, M. S. Jahan and M. Oussalah. Data Expansion using Back Translation and Paraphrasing for Hate Speech Detection. 2021
Y. Hirota, N. Garcia, M. Otani, C. Chu, Y. Nakashima, I.Taniguchi and T. Onoye. A Picture May Be Worth a Hundred Words for Visual Question Answering
Z. Hu and Z. Wang. Mining Consumer Brand Relationship from Social Media Data: A Natural Language Processing Approach

Project cited nlpaug

D. Garcia-Olano and A. Jain. Generating Counterfactual Explanations using Reinforcement Learning Methods for Tabular and Text data. 2019
L. Yi. Avengers: Achieving Superhuman Performance for Question Answering on SQuAD 2.0 Using Multiple Data Augmentations, Randomized Mini-Batch Training and Architecture Ensembling. 2020

Contributions

_{sakares saengkaew}

_{Binoy Dalal}

Name		Name	Last commit message	Last commit date
Latest commit History 647 Commits
.github		.github
docs		docs
example		example
nlpaug		nlpaug
res		res
scripts		scripts
test		test
.codacy.yml		.codacy.yml
.gitattributes		.gitattributes
.gitignore		.gitignore
.readthedocs.yml		.readthedocs.yml
.travis.yml		.travis.yml
CHANGE.md		CHANGE.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
SOURCE.md		SOURCE.md
codecov.yml		codecov.yml
conda.sh		conda.sh
meta.yaml		meta.yaml
pypi.sh		pypi.sh
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt
script.txt		script.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nlpaug

Features

Textual Data Augmentation Example

Acoustic Data Augmentation Example

Quick Demo

Augmenter

Flow

Installation

Recent Changes

1.1.8dev, Aug, 2021

Extension Reading

Reference

Citing

Book cited nlpaug

Research paper cited nlpaug

Project cited nlpaug

Contributions

About

Releases

Packages

Languages

License

jfecunha/nlpaug

Folders and files

Latest commit

History

Repository files navigation

nlpaug

Features

Textual Data Augmentation Example

Acoustic Data Augmentation Example

Quick Demo

Augmenter

Flow

Installation

Recent Changes

1.1.8dev, Aug, 2021

Extension Reading

Reference

Citing

Book cited nlpaug

Research paper cited nlpaug

Project cited nlpaug

Contributions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages