valeriansaliou / sonic

πŸ¦” Fast, lightweight & schema-less search backend. An alternative to Elasticsearch that runs on a few MBs of RAM.

Home Page:https://crates.io/crates/sonic-server

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

sonic fails to retrieve results on very simple queries

almosnow opened this issue Β· comments

I found this thing after configuring and running sonic for the first time,

Using telnet, I manually push the following data:

PUSH messages default id_1 "Some sample text number one"
PUSH messages default id_2 "Some sample text number two"
PUSH messages default id_3 "Some sample text number three"
PUSH messages default id_4 "Some sample text number four"

Then I consolidate the index, just to make sure:
TRIGGER consolidate

On all these commands (using their respective channel), I get an OK reply from the sonic process.

So now, when I run some sample queries, I get results on a few words but not on others:

QUERY messages default "Some", gives back no results, which is wrong ❌
QUERY messages default "sample" gives back id_4 id_3 id_2 id_1, which is correct βœ…
QUERY messages default "text", gives back no results, which is wrong ❌
QUERY messages default "number", gives back no results, which is wrong ❌
QUERY messages default "one", gives back no results, which is wrong ❌
QUERY messages default "two", gives back no results, which is wrong ❌
QUERY messages default "three", gives back no results, which is wrong ❌
QUERY messages default "four", gives back no results, which is wrong ❌

What is going on? Is this a bug or am I doing something wrong?

Thanks.

Some, text, number, one, two, three, four are stop words in English
You can see the whole list here https://github.com/valeriansaliou/sonic/blob/master/src/stopwords/eng.rs

Whoa, thanks @pleshevskiy, I missed that.

From what I see, that's a very broad list of stopwords. I found out this by debugging why a query with 'computer' was not retrieving a specific record. Now I see, 'computer' is a stop word as well.

Is there a way to make sonic use a specific list of stopwords? I could change it and recompile, of course, but maybe there's already a flag or something.

Hi @valeriansaliou! Thank you for sonic, it is a great tool.

Could you tell us a bit more about why did you chose those stopwords?

I plan to use the one from MySQL's MyISAM engine, which has worked fine for me in the past. See bottom of this page: https://dev.mysql.com/doc/refman/8.0/en/fulltext-stopwords.html

@almosnow I extracted them from list of stopwords from existing stopwords libraries, so it's a bit weird that this one is considered as one. I'm sure there are more non-stopwords there thus, I am accepting PRs for anyone who'd want to filter out those non-stopwords :)

I'm willing to help with that, but what criteria should we use to discriminate stop words? For instance, I wouldn't consider 'number' a stop word, but I can see many scenarios where it could be one.

Another alternative would be to encode different lists and let the user choose them in a similar way as LANG.

For reference, I found this: https://github.com/igorbrigadir/stopwords

I'm working on this for a different search, and I think it works best if there is a base list like Google uses, and then a function to add more words (or translate them). That way each environment can eliminate the words that are common for them.

Thank you, forgot to close it. Best to all!