Numbers being removed?

Question

Numbers being removed?

ernsheong opened this issue 6 years ago · comments

Jonathan Lin commented 6 years ago

It seems that numbers are being treated as stop words. This should not be the case?

Benjamin BALET · Answer 1 · Mon Dec 25 2017 14:21:20 GMT+0800 (China Standard Time)

Short answer: It looks like they are, but it is a side effect of the word segmenter being used.

Long answer: It is not that they stop words, but that they are not considered as word. Here is the workflow:

Segment the words by using the regexp [\pL\p{Mc}\p{Mn}-_']+ it means that everything that is not a word (series of runes separated by something) is not considered by the algorithm.
Iterate on each word. Does the word is a stop word? If so, eliminate it.
Optionally use the word in an algorithm such as SimHash.

We remove stop words in order to estimate the meaning of the text. Your question (or issue) means that you analyze texts where numbers have a meaning? If so, you could try to patch stopwords.go:28 with this small change

unicodeWords = regexp.MustCompile(`[\pL\p{Mc}\p{Mn}\p{Nd}-_']+`)

And tell us if it has improved your results.

Jonathan Lin · Answer 2 · Mon Dec 25 2017 15:05:44 GMT+0800 (China Standard Time)

One example use case is that "16-year-old" is output as "-year-old" which is unexpected. I think it should be made clearer that this repo does try to be more intelligent than a simple-stupid implementation does (ignores special runes), and that there are caveats. But I'll probably not try to change the behavior of this repo. Thanks :)

Benjamin BALET · Answer 3 · Mon Dec 25 2017 15:30:20 GMT+0800 (China Standard Time)

No need to change the main behavior, I can add a new function or a parameter and then everyone will be happy, because the added values of this repo are the curated list of stop words and being Unicode friendly even with Brahmic languages.

I just want to understand why some people think that including numbers can improve the estimation.

With your example, "16-year-old" is segmented in "year old". Both of them are not stop words. Then there is no difference between the estimated meaning of "a 16-year-old boy" and "a 13-year-old boy" because they both equal to "year old boy".

Jonathan Lin · Answer 4 · Mon Dec 25 2017 15:56:00 GMT+0800 (China Standard Time)

I just want to understand why some people think that including numbers can improve the estimation.

In my case, I am trying to use proper nouns and numbers as shingles to estimate document similarity. Hence keeping the numbers are of value because news articles tend to cite specifics in numbers, which improve estimation.

Admittedly what I would really like is to detect nouns, not just proper nouns.

Jonathan Lin · Answer 5 · Tue Dec 26 2017 12:23:03 GMT+0800 (China Standard Time)

Thank you very much!