bbalet / stopwords

Removes most frequent words (stop words) from a text content. Based on a Curated list of language statistics.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Numbers being removed?

ernsheong opened this issue · comments

It seems that numbers are being treated as stop words. This should not be the case?

Short answer: It looks like they are, but it is a side effect of the word segmenter being used.

Long answer: It is not that they stop words, but that they are not considered as word. Here is the workflow:

  1. Segment the words by using the regexp [\pL\p{Mc}\p{Mn}-_']+ it means that everything that is not a word (series of runes separated by something) is not considered by the algorithm.
  2. Iterate on each word. Does the word is a stop word? If so, eliminate it.
  3. Optionally use the word in an algorithm such as SimHash.

We remove stop words in order to estimate the meaning of the text. Your question (or issue) means that you analyze texts where numbers have a meaning? If so, you could try to patch stopwords.go:28 with this small change

unicodeWords = regexp.MustCompile(`[\pL\p{Mc}\p{Mn}\p{Nd}-_']+`)

And tell us if it has improved your results.

One example use case is that "16-year-old" is output as "-year-old" which is unexpected. I think it should be made clearer that this repo does try to be more intelligent than a simple-stupid implementation does (ignores special runes), and that there are caveats. But I'll probably not try to change the behavior of this repo. Thanks :)

No need to change the main behavior, I can add a new function or a parameter and then everyone will be happy, because the added values of this repo are the curated list of stop words and being Unicode friendly even with Brahmic languages.

I just want to understand why some people think that including numbers can improve the estimation.

With your example, "16-year-old" is segmented in "year old". Both of them are not stop words. Then there is no difference between the estimated meaning of "a 16-year-old boy" and "a 13-year-old boy" because they both equal to "year old boy".

I just want to understand why some people think that including numbers can improve the estimation.

In my case, I am trying to use proper nouns and numbers as shingles to estimate document similarity. Hence keeping the numbers are of value because news articles tend to cite specifics in numbers, which improve estimation.

Admittedly what I would really like is to detect nouns, not just proper nouns.

Thank you very much!