Stopwords

Question

Stopwords

abcp4 opened this issue 5 years ago · comments

Ivan Pereira commented 5 years ago

Hello, it seems like the stopwords aren't being filtered correctly:

The 'quick' word is not being ignored. It would be nice if it would just pass over them

Ivan Pereira · Answer 1 · Fri Jul 26 2019 05:45:11 GMT+0800 (China Standard Time)

A new issue with stopwords:

It seems like punctuation is turning it in a new word. Like, dog is not being filtered because of 'dog.' is being seen as a word.

Edward Ma · Answer 2 · Fri Jul 26 2019 09:53:09 GMT+0800 (China Standard Time)

You are right. Default tokenizer is splitting word by space.
tokens = text.split(' ')

Will enhance the implementation of tokenizer. Before that, there are 2 ways to overcome it.

Split punctuation. For example, changing input to 'The quick brown fox , jumps over the lazy dog .'.
Override custom tokenzier to the augmenter.

import re
# The _tokenizer is not good enough as punctuation will be removed in return.
def _tokenizer(text, token_pattern=r"(?u)\b\w\w+\b"):
            token_pattern = re.compile(token_pattern)
            return token_pattern.findall(text)

aug = nac.QwertyAug()
aug.tokenizer = _tokenizer