Stopwords
abcp4 opened this issue · comments
You are right. Default tokenizer is splitting word by space.
tokens = text.split(' ')
Will enhance the implementation of tokenizer. Before that, there are 2 ways to overcome it.
- Split punctuation. For example, changing input to 'The quick brown fox , jumps over the lazy dog .'.
- Override custom tokenzier to the augmenter.
import re
# The _tokenizer is not good enough as punctuation will be removed in return.
def _tokenizer(text, token_pattern=r"(?u)\b\w\w+\b"):
token_pattern = re.compile(token_pattern)
return token_pattern.findall(text)
aug = nac.QwertyAug()
aug.tokenizer = _tokenizer