fuzzy file matching is too fuzzy

Question

fuzzy file matching is too fuzzy

nocash opened this issue 9 years ago · comments

Looks like improvements are currently under development in atom/fuzzaldrin#22

Jean Christophe Roy · Answer 1 · Sun Jul 19 2015 02:53:15 GMT+0800 (China Standard Time)

That's interesting.

Why fuzzaldrin fail ?
It zoom on the filename basically doing score = score(fullpath) + score(filename).
It also bail out (return 0) when it does not find some chars.
Because it cannot find "model" in the zoomed "user.rb" that part get 0.

Will new version solve it ?
Good news is that dynamic programming support errors. I do the bail out thing to be compatible, and because It bring a lot of speed. But I do it on the full string.

Fullpath clearly score more in your "Best Result" mark.

Zommed in, it's less clear.
Wrong result contain the whole "user" part, as well as "mode" which is requested in the query.
Correct result start sooner and is shorter. However size & position does not weight for a lot. (This is done to support longer string where query is acronym)

All in all I believe it's a matter of the preference for file name or full path.
I might add a test about this as a remainder people use fullpath too.
Because now pretty much every test is "prefer basename", except for the case of exact match which does not cover this.

Aaron Jensen · Answer 2 · Sun Jul 19 2015 03:03:54 GMT+0800 (China Standard Time)

@jeancroy the problem is that the algorithm (and it sounds like your new optimizations) do not take into account word boundaries or match length. That's very important for fuzzy matching in my experience.

See the example here: https://github.com/garybernhardt/selecta#theory-of-operation
and the algorithm that this uses: https://github.com/JazzCore/ctrlp-cmatcher

Basically, the length of the match of app/models/user.rb when word boundaries are taken into account is as low as it can be (modeluser = 9) whereas the match length of the winning option is massive:

drop_moderator_column_on_users.rb
.....1234567890123.......4567.... -> 17

app/models/user.rb
....12345..6789... -> 9

By taking into account word boundaries and minimum match length, app/models/user is clearly better.

Aaron Jensen · Answer 3 · Sun Jul 19 2015 03:17:08 GMT+0800 (China Standard Time)

By the way, instead of considering base name vs full path, you might consider giving preference to shorter paths and matches closer to the end of the path. I'm not sure how selecta and ctrl-p-cmatcher do it, but they do it well

Jean Christophe Roy · Answer 4 · Sun Jul 19 2015 03:17:16 GMT+0800 (China Standard Time)

@aaronjensen added test, passed without modification 👍

Aaron Jensen · Answer 5 · Sun Jul 19 2015 03:19:17 GMT+0800 (China Standard Time)

awesome! hope to see it get merged in soon :) Thanks.

Jean Christophe Roy · Answer 6 · Sun Jul 19 2015 22:13:19 GMT+0800 (China Standard Time)

Because you where kind enough to share idea, I'll develop a bit more.

See that use case ?
atom/fuzzy-finder#57 (A)
Somehow ImportanceTableCtrl should be prefered to switch.css when searching for itc
So that put some limit to the size-related penalty.

Another one:
atom/fuzzaldrin#17 (B)
diagonals should be prefered to Diagonal when searching for diag
Here a single character case 'D' vs 'd', should win, even if there's one extra s at the end of the word.

Those expectations kind of puts haystack size at the role of tiebreaker.
Why would we want to penalize large string ? Because the sheer size of them allow to do accidental matches. However when they do so, it tend to be isolated character, in the middle of nowhere, and that's what we are going to detect ! We can do so by counting the number of consecutive match (grouped character) and giving large bonus for that.

However in this case "moderator" is almost "model". Also my script would have attached the "u" to "user" instead of "columns". Final both match landmark character (word boundary) at "m" and "u". So I cannot guarantee that large string will be scored like garbage (because it's not garbage) and a tie breaker is exactly what you need.

Another interesting fact is that I find consecutive character very intuitive and use it to resolve otherwise contradictory cases. How do i know the lowercase "i" of "itc" should prefer uppercase "I" of "Importance" while lowercase "d" of "diag" should reject uppercase "D" of "Diagonal" in favor of "diagonal" ? Well "diag" score consecutive point in the actual word, while "itc" score consecutive in the acronym of the word ! Where the consecutive are, control the affinity for exact case vs acronym camelCase. It also ensure that if a large string accidentally match a landmark, not part of a sequence requested by query, we still get garbage like score.