makcedward / nlpaug

Data augmentation for NLP

Home Page:https://makcedward.github.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Stopwords

abcp4 opened this issue · comments

Hello, it seems like the stopwords aren't being filtered correctly:

image

The 'quick' word is not being ignored. It would be nice if it would just pass over them

A new issue with stopwords:
image

image

image
It seems like punctuation is turning it in a new word. Like, dog is not being filtered because of 'dog.' is being seen as a word.

You are right. Default tokenizer is splitting word by space.
tokens = text.split(' ')

Will enhance the implementation of tokenizer. Before that, there are 2 ways to overcome it.

  1. Split punctuation. For example, changing input to 'The quick brown fox , jumps over the lazy dog .'.
  2. Override custom tokenzier to the augmenter.
import re
# The _tokenizer is not good enough as punctuation will be removed in return.
def _tokenizer(text, token_pattern=r"(?u)\b\w\w+\b"):
            token_pattern = re.compile(token_pattern)
            return token_pattern.findall(text)

aug = nac.QwertyAug()
aug.tokenizer = _tokenizer