fts_fuzzy_match: Should consecutive matches and full word matches have a higher bonus?

Question

fts_fuzzy_match: Should consecutive matches and full word matches have a higher bonus?

dogoku opened this issue 7 years ago · comments

I really enjoyed your article on Sublime's fuzzy search and appreciate your efforts to recreate what seems to be some kind of magic to me.

After playing around on your live demo, I noticed that consecutive matches were weighted less than first character matches. E.g:

Search for cold in Hearthstone cards

23 - Cone of Cold
20 - Cold Blood
20 - Coldarra Drake
15 - Coldlight Seer
13 - Coldlight Oracle
4 - Cobalt Guardian
-21 - Ancestral Knowledge

In comparison, searching in Sublime, full word or consecutive matches would rank higher than first letter matches. E.g:

Search for node in a node.js project in Sublime

You can see the top results are full word matches and the shortest paths seems to weighted more. It takes something like 20 results for the first non-full match to appear (255).

Perhaps this comes down the scale of weighting you are using, as from the screenshot, we can see Sublime scores are in the 200+ region, which allows for a larger spread of scores.

Anyway, its fun to think about nonetheless

Forrest Smith · Answer 1 · Sat Jan 07 2017 17:58:27 GMT+0800 (China Standard Time)

That's an interesting thought. Sublime focuses pretty heavily on matching first characters of "words". So that was my initial focus.

Identifying "words" in a string could be useful. I'd also thought about growing the adjacency bonus for each additional match. It's been awhile so I don't recall if I actually tried it or not.

Sublime author Jon Skinner responded to my Reddit thread on my blog post associated with fts_fuzzy_match. He pointed out that I match "lll" (those are L's btw) quite poorly with my UE3 sample data. Sublime does a more comprehensive match of all possible ways to match a pattern to an input string and returns the highest match.

At some point I'd like to revisit this code and add support for comprehensive matching. Will have to do some good benchmarking to see how much slower it is.

I'm also somewhat of the opinion that there's no such thing as a "perfect" fuzzy match score system. It depends on your use case. Matching filenames might want different scores than card names. Searching log files might want something different still.

But I totally agree I could do better. Your example is very useful. I'm gonna keep this issue open and maybe some day in the future come back to it. I certainly hope so! :)

Deleted user · Answer 2 · Tue Dec 05 2017 02:53:50 GMT+0800 (China Standard Time)

Why not leave the scoring to the user? If you define an enum for the "type" of match, e.g. FULL_WORD, TRANSPOSE etc. and then provide #defines to let the user define additional score weights.

#define FULL_WORD_SCORE 1.05

Something like that?