jasonwei20 / eda_nlp

Data augmentation for NLP, presented at EMNLP 2019

Home Page:https://arxiv.org/abs/1901.11196

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Removal of apostrophes, hyphens and things.

AllBecomesGood opened this issue · comments

So in eda.py you remove several things like:
line = line.replace("’", "")
line = line.replace("'", "")
line = line.replace("-", " ")

And I was wondering why is that? Cause while this augmentation method improved my results dramatically I now need to somehow get data back in which let's the bot learn that "I'm" is the same as "I am" etc, as the data now only ever includes "im".
Is this some limitation of WordNet or something?

I don't know if WordNet is a the limitation, but I don't think so. Basically having the punctuation makes EDA more complicated, so I removed it. You're welcome to add it back, and if you have a good solution, feel free to send a PR.