algolia / npm-search

🗿 npm ↔️ Algolia replication tool :skier: :snail: :artificial_satellite:

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Consider lowering the importance of "proximity" in ranking

MartinKolarik opened this issue · comments

I just came across a very unintuitive behavior searching for the package bootstrap-vue. I didn't remember the name exactly and instead searched for vue bootstrap: https://www.jsdelivr.com/?query=vue%20bootstrap

The query didn't match the package name because of the order of words, but still, the package has "vue" and "bootstrap" as its keywords as well and is very popular, so I'd expect it to be the top result. Unfortunately, it seems the engine treats the array of keywords similarly to text, and so the order and proximity of them still play a role. Instead of the correct result, I got pages and pages of garbage, which happened to have the keywords in the "right" order.

Looking at the config options, I see this could be fixed by swapping the priority of "attribute" and "proximity" in ranking, and at first sight, this makes sense to me. Our searchable attributes are:

  • popular name,
  • name, description, keywords,
  • owner name,
  • owners name.

Package names are very short, and hitting more of the correct words should be more important than hitting them in the right order. For keywords, this is absolutely the case. For description, I'm not sure which one makes more sense, but even if it was proximity, I think good matching on names and keywords is more important.

We should definitely test this more before making any changes but I'm putting it as an idea here. @pixelastic @Haroenv what do you think? Are there any cases you can think of this would make worse? I already made this change on index npm-search-dev-martin for testing.

It seems like in your index the results are usually better, but it's indeed hard to test. We probably should list some of the popular queries that aren't verbatim a popular package and test those?

Note that even after this change, proximity might have more importance than it deserves.

Consider a query "themes bootstrap" for which probably the most relevant result would be bootswatch but in its description, it has "Bootswatch is a collection of themes for Bootstrap." so because of the extra "for" it has proximity 2 and gets pushed to the 4th place, below fairly unpopular packages.

Even worse, if you happen to search for "bootstrap themes", then the computed proximity is 3 (reverse order) and bootswatch is on position 22, even after some deprecated packages.

Changing minProximity might make sense here - the proximity for sure has some value, but not that big since we have a lot of custom ranking attributes as well. I'd say 3 is an absolute minimum for simple cases like this, but it might very well be even 5 ("can be together anywhere in a short sentence").

After testing this a bit more on random multi-word package names, I don't see any obvious downside here, and it makes a big difference when you query multi-word package names with non-exact names, e.g.:

image

Considering the settings can be changed in a matter of seconds as needed, I'd say let's change this and readjust later if we find any issues.

Let's do it!