Paragraf subset augmenter
martincjespersen opened this issue · comments
A paragraf subset augmentation which can work on token and sentence level. It will sample a random percentage of included coherent tokens/sentences and a random token/sentence start position ensuring the former constraint is maintained. The augmenter needs to handle annotated entities and avoid breaking them.
Input arguments:
level: how often to apply augmenter
min_paragraf: Minimum percentage of tokens or sentences to include. Ie. 4 sentences with min_paragraf=0.5 means it as a minimum includes 2 sentences.
sentence_level: Boolean to define if token or sentence level to define
Example - sentence level
import augmenty
import spacy
nlp = spacy.load("en_core_web_sm")
# four sentences
texts = [
"Augmenty is a wonderful tool for augmentation. Augmentation is a wonderful tool"
"for obtaining higher performance on limited data. You can also use it to see how "
"robust your model is to changes. It will sample subset of the paragraf.",
]
docs = nlp(texts)
augmenter = augmenty.load("paragraf_subset.v1", level=1.0, min_paragraf=0.5, sentence_level=True)
list(augmenty.texts(texts, augmenter, nlp))
Example outputs:
The first section:
Augmenty is a wonderful tool for augmentation. Augmentation is a wonderful tool
for obtaining higher performance on limited data.
The middle section:
Augmentation is a wonderful tool for obtaining higher performance on limited data.
You can also use it to see how robust your model is to changes.
The middle section:
You can also use it to see how robust your model is to changes. It will sample subset
of the paragraf.
Additional thoughts:
Possibly addition of a reverse augmenter, eg. removing a coherent section of tokens/sentences.
They way augmenty is set up now it only allows augmentation within sample, i.e. for :
"Augmenty is a wonderful tool for augmentation. Augmentation is a wonderful tool"
you could get:
"Augmenty is a wonderful tool for augmentation.
"Augmentation is a wonderful tool"
"Augmenty is a wonderful tool for augmentation. Augmentation is a wonderful tool"
But never:
Augmenty is a wonderful tool for augmentation. Augmentation is a wonderful tool
for obtaining higher performance on limited data.
I still think the augmenter is relevant though. The other point would require #14, which is a known problem with spaCy augmentation setup as it currently stands.
Will be added in #50
Added in newest version