KennethEnevoldsen / augmenty

Augmenty is an augmentation library based on spaCy for augmenting texts.

Home Page:https://kennethenevoldsen.github.io/augmenty/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Misaligned Token after Data Augmentation

stefkr1 opened this issue · comments

Hi Kenneth,

first of all, I would like to thank you for publishing the amazing augmentation library Augmenty. It provides a wide range of augmentation possibilities in terms of modifications and modification levels.

I am using Augmenty to modify emails, with the target to test the robustness of my custom (spacy) NER model and to increase the model's robustness. I annotated the named entities (15 labels) of the emails using Prodigy and saved it in the spacy format (DocBin). Subsequently, I trained a German NER model with spacy. The data augmentation of the annotated data was quite straight forward (You find my custom augmenter attached to the email). Here is my code:

nlp = spacy.load("/home/models/model-best/")
corpus = Corpus('data/' + db + '.spacy')
augmented_corpus = [
e for example in corpus(nlp) for e in augmenter(nlp, example)
]
docs: Dict = {"data": []}
for eg in augmented_corpus:
doc = eg.reference
docs["data"].append(doc)

docbin = DocBin(docs=docs["data"],
attrs=["ENT_IOB", "ENT_TYPE"],
store_user_data=True)
docbin.to_disk('data/' + db + '_augmented.spacy')

I encountered a token alignment problem in the training process. Spacy (python -m spacy debug data) returned a warning, that several tokens are misaligned. I am wondering why this happens. In the example “applying augmentation to examples or a corpus” you use a dataset in the CONLL (?) format, while I use a .DocBin file as basis. May this be the reason? Or do I need to change the tokenizer settings?

[nlp]
lang = "de"
pipeline = ["tok2vec","ner"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@Tokenizers":"spacy.Tokenizer.v1"}

I looked into the augmented dataset and I didn’t find a clear pattern for the misalignment. Here is a short example:

corpus = Corpus('data/' + dbs_train[1] + '_augmented.spacy')
examples_aug = []
for example in corpus(nlp):
examples_aug.append(example)

eg_aug=examples_aug[4]
align_aug = eg_aug.alignment
gold_aug = eg_aug.reference

for token in gold_aug:
output_aug.append(str(token) + ' ' + str(align_aug.x2y.lengths[token.i]))

With align.x2y.lengths[token.i] some numbers are greater than 1, which means misalignment. But I don’t understand the output:
['\n\n 1', 'Von 1', ': 3', 'Miller 1']

In this case the ‘:’ has a number of 3, why is that? Can you help me with this issue?

For reproducability I have sent you the data via email.

Many thanks!

Your Environment

  • augmenty Version Used: 1.4.3
  • Operating System: Red Hat Enterprise Linux 8.8 (Ootpa)
  • Python Version Used: 3.9
  • spaCy Version Used: 3.7.4
  • Environment Information: Posit 2023.03.0

Hi @stefkr1 thanks for creating the issue as well. Since there are many places where the error could happen I would love to get a minimally reproducible example. What text does the error happen and, what would you expect to happen and what happened? (you can just create it as a dummy document).

If you would rather debug yourself I would try to simplify your augmenter to just use a few augmenters and then gradually add on more augmenters. That should help us narrow down the augmenter where the error happens.

I just applied the custom augmentizer to the dataset I had sent to you by email.

If I then retrieve the examples from the .spacy dataset and compare the reference with the predicted tokens, I see a mismatch. And This should not be.

Here is what I did:

  1. Extract the examples from the .spacy dataset:

corpus = Corpus(‘augmented.spacy')
examples_aug = []
for example in corpus(nlp):
examples_aug.append(example)

eg_aug=examples_aug[1]

  1. Retrieve the predictions and the reference from the examples containing the augmentations:

eg_aug=examples_aug[1] # in my case this contains the augmented data, because I generate two augmented examples and keep the original example using augmenty.yield_original()
reference = eg_aug.reference
predictions = eg_aug.predicted

  1. Compare the number of tokens of the predictions and the reference:

len(reference)
len(predictions)

-> The lengths disagree

  1. Compare the token at the ith position:

reference[442]
predictions[442]

-> The token values disagree

  1. Also I loaded the tokenizer from the nlp model and applied it to the example’s text to predict its tokens:

tokenizer = nlp.tokenizer
tokens = tokenizer(reference.text)

-> Len(tokens) matches the number of tokens in the predictions.

If I apply the above procedure to the not augmented dataset, the number of tokens is the same for the reference and the predictions.

Can you please try to reproduce my approach?

I would expect the number of tokens to change when you augment the text (at least that often happens). As long as the gold standard annotation is augmented to match that should not be a problem.

It is similarly totally valid to have differing number of tokens between reference and predicted. Take e.g. the sentence:

"My name is john" - (augment) -> "Mynameisjohn". The gold tokens can still be My|name|is|john, but the tokenizer would tokenize it as Mynameisjohn (one token). That is perfectly fine and should not cause any problems during training.

(when I reproduce it I also get a different number of tokens, but that is not an issue)

This issue is stale because it has been open for 14 days with no activity. Feel free to either 1) remove the stale label or 2) comment. If nothing happens, this will be closed in 7 days.

This issue was closed automatically. Feel free to re-open it if it's important.