forrestthewoods / lib_fts

single-file public domain libraries

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

fts_fuzzy_match: Should consecutive matches and full word matches have a higher bonus?

dogoku opened this issue · comments

commented

I really enjoyed your article on Sublime's fuzzy search and appreciate your efforts to recreate what seems to be some kind of magic to me.

After playing around on your live demo, I noticed that consecutive matches were weighted less than first character matches. E.g:

Search for cold in Hearthstone cards

  • 23 - Cone of Cold
  • 20 - Cold Blood
  • 20 - Coldarra Drake
  • 15 - Coldlight Seer
  • 13 - Coldlight Oracle
  • 4 - Cobalt Guardian
  • -21 - Ancestral Knowledge

In comparison, searching in Sublime, full word or consecutive matches would rank higher than first letter matches. E.g:

Search for node in a node.js project in Sublime

screen shot 2017-01-06 at 02 36 11

You can see the top results are full word matches and the shortest paths seems to weighted more. It takes something like 20 results for the first non-full match to appear (255).

screen shot 2017-01-06 at 02 37 23

Perhaps this comes down the scale of weighting you are using, as from the screenshot, we can see Sublime scores are in the 200+ region, which allows for a larger spread of scores.

Anyway, its fun to think about nonetheless

That's an interesting thought. Sublime focuses pretty heavily on matching first characters of "words". So that was my initial focus.

Identifying "words" in a string could be useful. I'd also thought about growing the adjacency bonus for each additional match. It's been awhile so I don't recall if I actually tried it or not.

Sublime author Jon Skinner responded to my Reddit thread on my blog post associated with fts_fuzzy_match. He pointed out that I match "lll" (those are L's btw) quite poorly with my UE3 sample data. Sublime does a more comprehensive match of all possible ways to match a pattern to an input string and returns the highest match.

At some point I'd like to revisit this code and add support for comprehensive matching. Will have to do some good benchmarking to see how much slower it is.

I'm also somewhat of the opinion that there's no such thing as a "perfect" fuzzy match score system. It depends on your use case. Matching filenames might want different scores than card names. Searching log files might want something different still.

But I totally agree I could do better. Your example is very useful. I'm gonna keep this issue open and maybe some day in the future come back to it. I certainly hope so! :)

Why not leave the scoring to the user? If you define an enum for the "type" of match, e.g. FULL_WORD, TRANSPOSE etc. and then provide #defines to let the user define additional score weights.

#define FULL_WORD_SCORE 1.05

Something like that?