nlpaug

This python library helps you with augmenting nlp for your machine learning projects. Visit this introduction to understand about Data Augmentation in NLP. Augmenter is the basic element of augmentation while Flow is a pipeline to orchestra multi augmenter together.

Features

Generate synthetic data for improving model performance without manual effort
Simple, easy-to-use and lightweight library. Augment data in 3 lines of code
Plug and play to any neural network frameworks (e.g. PyTorch, TensorFlow)
Support textual and audio input

Textual Data Augmentation Example

Acoustic Data Augmentation Example

Section	Description
Quick Demo	How to use this library
Augmenter	Introduce all available augmentation methods
Installation	How to install this library
Recent Changes	Latest enhancement
Extension Reading	More real life examples or researchs
Reference	Refernce of external resources such as data or model

Quick Demo

Augmenter

Augmenter	Target	Augmenter	Action	Description
Textual	Character	KeyboardAug	substitute	Simulate keyboard distance error
Textual		OcrAug	substitute	Simulate OCR engine error
Textual		RandomAug	insert, substitute, swap, delete	Apply augmentation randomly
Textual	Word	AntonymAug	substitute	Substitute opposite meaning word according to WordNet antonym
Textual		ContextualWordEmbsAug	insert, substitute	Feeding surroundings word to BERT, DistilBERT, RoBERTa or XLNet language model to find out the most suitlabe word for augmentation
Textual		RandomWordAug	swap, delete	Apply augmentation randomly
Textual		SpellingAug	substitute	Substitute word according to spelling mistake dictionary
Textual		SplitAug	split	Split one word to two words randomly
Textual		SynonymAug	substitute	Substitute similar word according to WordNet/ PPDB synonym
Textual		TfIdfAug	insert, substitute	Use TF-IDF to find out how word should be augmented
Textual		WordEmbsAug	insert, substitute	Leverage word2vec, GloVe or fasttext embeddings to apply augmentation
Textual	Sentence	ContextualWordEmbsForSentenceAug	insert	Insert sentence according to XLNet, GPT2 or DistilGPT2 prediction
Signal	Audio	CropAug	delete	Delete audio's segment
Signal		LoudnessAug	substitute	Adjust audio's volume
Signal		MaskAug	substitute	Mask audio's segment
Signal		NoiseAug	substitute	Inject noise
Signal		PitchAug	substitute	Adjust audio's pitch
Signal		ShiftAug	substitute	Shift time dimension forward/ backward
Signal		SpeedAug	substitute	Adjust audio's speed
Signal		VtlpAug	substitute	Change vocal tract
Signal	Spectrogram	FrequencyMaskingAug	substitute	Set block of values to zero according to frequency dimension
Signal		TimeMaskingAug	substitute	Set block of values to zero according to time dimension

Flow

Augmenter	Augmenter	Description
Pipeline	Sequential	Apply list of augmentation functions sequentially
Pipeline	Sometimes	Apply some augmentation functions randomly

Installation

The library supports python 3.5+ in linux and window platform.

To install the library:

pip install nlpaug numpy matplotlib python-dotenv

or install the latest version (include BETA features) from github directly

pip install git+https://github.com/makcedward/nlpaug.git numpy matplotlib python-dotenv

If you use ContextualWordEmbsAug or ContextualWordEmbsForSentenceAug, install the following dependencies as well

pip install torch>=1.2.0 transformers>=2.5.0

If you use AntonymAug, SynonymAug, install the following dependencies as well

pip install nltk>=3.4.5

If you use WordEmbsAug (word2vec, glove or fasttext), downloading pre-trained model first

from nlpaug.util.file.download import DownloadUtil
DownloadUtil.download_word2vec(dest_dir='.') # Download word2vec model
DownloadUtil.download_glove(model_name='glove.6B', dest_dir='.') # Download GloVe model
DownloadUtil.download_fasttext(model_name='wiki-news-300d-1M', dest_dir='.') # Download fasttext model

If you use SynonymAug (PPDB), downloading file from the following URI. You may not able to run the augmenter if you get PPDB file from other website

http://paraphrase.org/#/download

If you use any one of audio augmenter, install the following dependencies as well

pip install librosa>=0.7.1

Recent Changes

**0.0.14 Apr 24, 2020

Remove QWERTAug example (Replaced by KeyboardAug) [#110] (makcedward#110)
Fix [#117] (makcedward#117), [#114] (makcedward#114), [#111] (makcedward#111), #105
Support Change Log [#116] (makcedward#117)
Fix typo [#123] (makcedward#123)
Support accepting candidates in RandomCharAug [#125] (makcedward#125)

**0.0.13 Feb 25, 2020

Fix spectrogram tutorial notebook [#98] (makcedward#98)
Fix RandomWordAug missed aug_max parameter [#100] (makcedward#100)
Fix loading KeyboardAug model problem [#101] (makcedward#101)
Fix performance issue when sampling candidate in ContextualWordEmbsAug and ContextualWordEmbsForSentenceAug #107

See changelog for more details.

Extension Reading

Reference

This library uses data (e.g. capturing from internet), research (e.g. following augmenter idea), model (e.g. using pre-trained model) See data source for more details.

Citing

@misc{ma2019nlpaug,
  title={NLP Augmentation},
  author={Edward Ma},
  howpublished={\url{https://github.com/makcedward/nlpaug}},
  year={2019}
}

Contributions (Supporting Other Languages)

sakares: Add Thai support to KeyboardAug

About

Data augmentation for NLP

https://makcedward.github.io/

MIT License

Languages

Language:Jupyter Notebook 59.1%Language:Python 40.9%