zarnitsa package

Zarnitsa package with data augmentation tools.

Internal data augmentation using existed data
External data augmentation setting known statistical distributions by yourself
NLP augmentation

Principal scheme of project (currently)

Requirements

Python3
numpy
pandas
nlpaug
wget
scikt-learn

Installation

Install package using PyPI:

pip install zarnitsa

Or using actual github repo:

pip install git+https://github.com/AlexKay28/zarnitsa

Usage

Simple usage examples:

Augmentation internal.

This is type of augmentation which you may use in case of working with numerical features.

>>> from zarnitsa.DataAugmenterInternally import DataAugmenterInternally

>>> daug_comb = DataAugmenterInternally()
>>> aug_types = [
>>>     "normal",
>>>     "uniform",
>>>     "permutations",
>>> ]

>>> # pd Series object example
>>> s = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>>> for aug_type in aug_types:
>>>     print(aug_type)
>>>     print(daug_comb.augment_column(s, freq=0.5, return_only_aug=True, aug_type=aug_type))
normal
7     9.958794
3     0.057796
0    -3.135995
6     7.197400
8    13.258087
dtype: float64
uniform
2    10.972232
8     5.335357
9     9.111281
5     5.964971
4    -0.210732
dtype: float64
permutations
4     6
5     4
9    10
3     3
2     5
dtype: int64

Augmentation NLP

This is type of augmentation which you may use in case of working with textual information.

>>> from zarnitsa.DataAugmenterNLP import DataAugmenterNLP

>>> daug = DataAugmenterNLP()

>>> text = "This is sentence example to augment. Thank you for interesting"

>>> daug.augment_column_wordnet(text)
'This be sentence example to augment. Thank you for concern'

>>> daug.augment_column_del(text, reps=1)
'This is sentence example to augment. you for interesting'

>>> daug.augment_column_permut(text, reps=1)
'This sentence is example to augment. Thank you for interesting'

Augmentation External

This is type of augmentation which you may use in case of working with distribution modeling having prior knowlege about it

Doing df[...] = np.nan we imitate data sparsness or misses which we try to fill up using augmentations

>>> size = 500
>>> serial_was = pd.Series(daug.augment_distrib_random(aug_type='normal', loc=0, scale=1, size=size))
>>> serial_new = copy(serial_was)![Screenshot_20210724_133433](https://user-images.githubusercontent.com/55444371/126865853-2e09b4dd-f864-43d8-9e63-741c3862d153.png)

>>> serial_new.loc[serial_new.sample(100).index] = None
>>> serial_new = daug.augment_column(serial_new, aug_type='normal', loc=0, scale=1)

>>> plt.figure(figsize=(12, 8))
>>> serial_was.hist(bins=100)
>>> serial_new.hist(bins=100)

>>> size=50
>>> df = pd.DataFrame({
>>>     'data1': daug.augment_distrib_random(aug_type='normal', loc=0, scale=1, size=size),
>>>     'data2': daug.augment_distrib_random(aug_type='normal', loc=0, scale=1, size=size),
>>> })
>>> for col in df.columns:
>>>     df[col].loc[df[col].sample(10).index] = None
>>> plt.figure(figsize=(12, 8))
>>> df.plot()
>>> daug.augment_dataframe(df, aug_type='normal', loc=0, scale=1).plot()

AlexKay28 / zarnitsa