KennethEnevoldsen / augmenty

Augmenty is an augmentation library based on spaCy for augmenting texts.

Home Page:https://kennethenevoldsen.github.io/augmenty/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Use of augmenty with spacy config files for training

Giles-Billenness opened this issue · comments

I didn't see any documentation on how to import these augmenters when using spacy 3.0's config and command line system when training.
Is it possible to use it in this sense?
If so, how?

apon further review, for the command line to register new augmentations, the flag:
-- code <code.py>
Needs to be set when calling the training. I have tried to point to the specific file that contains the keystroke aug that I wanted but it complains about not knowing a parent for relative imports. I also tried the various init.py files but it complained also.
It seems to work when you take the code out and place it in a new file without relative imports and point to that.

image

Which page or section is this issue related to?

https://spacy.io/usage/training#data-augmentation-custom

https://kennethenevoldsen.github.io/augmenty/tutorials/introduction.html#Applying-the-augmentation

Hi @Giles-Billenness,

Yes as you correct you indeed need to supply the --code flag e.g. --code my_augmenters.py.

Where I believe the script my_augmenters.py could simply contain the code:

# my_augmenters.py

import augmenty

As importing augmenty will add all the augmenters to the spacy augmenter registry. This should allow you to add the following to your config:

[corpora.train.augmenter]
@augmenters = "keystroke_error.v1"
level=0.1,
keyboard="en_qwerty.v1"

If you want slightly more complex augmentation you can combine multiple augmenters using the augmenty.combine. This could looke something like this:

# my_augmenters.py

import augmenty
import spacy

# add it do the spacy registry such that you can call it from the config
@spacy.registry.augmenters("my_custom_augmenter")
def combined_augmenters():
    """A combined augmenter which add semi-realistic keystroke errors and swaps 2% of tokens. """
    key_aug = augmenty.load("keystroke_error.v1", level=0.02, keyboard="en_qwerty.v1")
    swap_aug = augmenty.load("token_swap.v1", level=0.02)
    augmenters = [key_aug, swap_aug]
    return augmenty.combine(augmenters)

And then you should be able to add to the config:

[corpora.train.augmenter]
@augmenters = "my_custom_augmenter"

For more inspiration I have somes file here were I train the Danish spaCy pipeline DaCy. For the command you can always check out the yml file and for the augmenters you can check the script: danish_augmenter.py.

Let me know if it works otherwise I will have another look at it.

Yeah, that worked for me thank you.

Good to know. I will close the issue - do let me know if there is any other issues with the package