KennethEnevoldsen / augmenty

Augmenty is an augmentation library based on spaCy for augmenting texts.

Home Page:https://kennethenevoldsen.github.io/augmenty/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Paragraf subset augmenter

martincjespersen opened this issue · comments

A paragraf subset augmentation which can work on token and sentence level. It will sample a random percentage of included coherent tokens/sentences and a random token/sentence start position ensuring the former constraint is maintained. The augmenter needs to handle annotated entities and avoid breaking them.

Input arguments:
level: how often to apply augmenter
min_paragraf: Minimum percentage of tokens or sentences to include. Ie. 4 sentences with min_paragraf=0.5 means it as a minimum includes 2 sentences.
sentence_level: Boolean to define if token or sentence level to define

Example - sentence level

import augmenty
import spacy
nlp = spacy.load("en_core_web_sm")

# four sentences
texts = [
    "Augmenty is a wonderful tool for augmentation. Augmentation is a wonderful tool"
    "for obtaining higher performance on limited data. You can also use it to see how "
    "robust your model is to changes. It will sample subset of the paragraf.",
]
docs = nlp(texts)

augmenter = augmenty.load("paragraf_subset.v1", level=1.0, min_paragraf=0.5, sentence_level=True)

list(augmenty.texts(texts, augmenter, nlp))

Example outputs:

The first section:

Augmenty is a wonderful tool for augmentation. Augmentation is a wonderful tool 
for obtaining higher performance on limited data.

The middle section:

Augmentation is a wonderful tool for obtaining higher performance on limited data. 
You can also use it to see how robust your model is to changes.

The middle section:

You can also use it to see how robust your model is to changes. It will sample subset 
of the paragraf.

Additional thoughts:

Possibly addition of a reverse augmenter, eg. removing a coherent section of tokens/sentences.

They way augmenty is set up now it only allows augmentation within sample, i.e. for :
"Augmenty is a wonderful tool for augmentation. Augmentation is a wonderful tool"

you could get:

"Augmenty is a wonderful tool for augmentation.
"Augmentation is a wonderful tool"
"Augmenty is a wonderful tool for augmentation. Augmentation is a wonderful tool"

But never:

Augmenty is a wonderful tool for augmentation. Augmentation is a wonderful tool 
for obtaining higher performance on limited data.

I still think the augmenter is relevant though. The other point would require #14, which is a known problem with spaCy augmentation setup as it currently stands.

Will be added in #50

Added in newest version