sonic fails to retrieve results on very simple queries
almosnow opened this issue Β· comments
I found this thing after configuring and running sonic for the first time,
Using telnet, I manually push the following data:
PUSH messages default id_1 "Some sample text number one"
PUSH messages default id_2 "Some sample text number two"
PUSH messages default id_3 "Some sample text number three"
PUSH messages default id_4 "Some sample text number four"
Then I consolidate the index, just to make sure:
TRIGGER consolidate
On all these commands (using their respective channel), I get an OK reply from the sonic process.
So now, when I run some sample queries, I get results on a few words but not on others:
QUERY messages default "Some"
, gives back no results, which is wrong β
QUERY messages default "sample"
gives back id_4 id_3 id_2 id_1
, which is correct β
QUERY messages default "text"
, gives back no results, which is wrong β
QUERY messages default "number"
, gives back no results, which is wrong β
QUERY messages default "one"
, gives back no results, which is wrong β
QUERY messages default "two"
, gives back no results, which is wrong β
QUERY messages default "three"
, gives back no results, which is wrong β
QUERY messages default "four"
, gives back no results, which is wrong β
What is going on? Is this a bug or am I doing something wrong?
Thanks.
Some, text, number, one, two, three, four are stop words in English
You can see the whole list here https://github.com/valeriansaliou/sonic/blob/master/src/stopwords/eng.rs
Whoa, thanks @pleshevskiy, I missed that.
From what I see, that's a very broad list of stopwords. I found out this by debugging why a query with 'computer' was not retrieving a specific record. Now I see, 'computer' is a stop word as well.
Is there a way to make sonic use a specific list of stopwords? I could change it and recompile, of course, but maybe there's already a flag or something.
Yes! That's possible, stopwords are listed there: https://github.com/valeriansaliou/sonic/tree/master/src/stopwords
Weirdly, computer is listed as one for English: https://github.com/valeriansaliou/sonic/blob/master/src/stopwords/eng.rs#L210
Hi @valeriansaliou! Thank you for sonic, it is a great tool.
Could you tell us a bit more about why did you chose those stopwords?
I plan to use the one from MySQL's MyISAM engine, which has worked fine for me in the past. See bottom of this page: https://dev.mysql.com/doc/refman/8.0/en/fulltext-stopwords.html
@almosnow I extracted them from list of stopwords from existing stopwords libraries, so it's a bit weird that this one is considered as one. I'm sure there are more non-stopwords there thus, I am accepting PRs for anyone who'd want to filter out those non-stopwords :)
I'm willing to help with that, but what criteria should we use to discriminate stop words? For instance, I wouldn't consider 'number' a stop word, but I can see many scenarios where it could be one.
Another alternative would be to encode different lists and let the user choose them in a similar way as LANG.
For reference, I found this: https://github.com/igorbrigadir/stopwords
I'm working on this for a different search, and I think it works best if there is a base list like Google uses, and then a function to add more words (or translate them). That way each environment can eliminate the words that are common for them.
Thank you, forgot to close it. Best to all!