Search does not find all matches

Question

Search does not find all matches

jwillmer opened this issue 8 years ago · comments

Maybe I missunderstand the search but I think it does not work as it should. I use fuzisearch.js to compare urls with each other.

If someone misstypes a url like:

http://jwillmer.github.io/jekyllDecent/features

he is redirected to an error page and I add all urls from my website:

    {
        "title": "Theme Installation and Usage",
        "url": "http://jwillmer.github.io/jekyllDecent/blog/readme/Readme",
    },
    {
        "title": "Theme Features",
        "url": "http://jwillmer.github.io/jekyllDecent/blog/features/Features",
    },
    {
        "title": "YAML Custom Features",
        "url": "http://jwillmer.github.io/jekyllDecent/blog/features/YAML-Features",
    },
    {
        "title": "This post demonstrates post content styles",
        "url": "http://jwillmer.github.io/jekyllDecent/blog/features/Content",
    }

to fuzisearch.js and as search result I only get one url back:

http://jwillmer.github.io/jekyllDecent/blog/features/Content

Shouldn’t there be at least three urls that have some kind of match?

Glen Chiacchieri · Answer 1 · Thu Jun 16 2016 03:33:51 GMT+0800 (China Standard Time)

hi @jwillmer! Is is possible for you to set up a simple test page for me to see how you're using fuzzyset and how it's failing? That would make it a lot easier for me to understand if there's a bug and how to fix it. Thanks!

Jens Willmer · Answer 2 · Thu Jun 16 2016 04:03:11 GMT+0800 (China Standard Time)

What is the problem with my example? The files are not minified, the init for fuzziset.js is in offline.jsand you can find the repo at GitHub.

Glen Chiacchieri · Answer 3 · Thu Jun 16 2016 09:22:42 GMT+0800 (China Standard Time)

Okay, I see now. I was worried that there was lots of other code that would interfere with the bug-finding process, but it's not bad.

I briefly looked into it, but I don't know why it's happening. It's only matching on gram size 3, returning one result, and not checking for gram size 2 because it found a result for 3. I probably won't get around to checking this for a long time, but you might try messing with the gram sizes in the mean time, @jwillmer.

Jens Willmer · Answer 4 · Thu Jun 16 2016 14:18:32 GMT+0800 (China Standard Time)

It could be that a fork of the project has fixed the bug without knowing that it is a bug: willlma@b6d9c59

Zbigniew Zagórski · Answer 5 · Tue Sep 20 2016 15:45:07 GMT+0800 (China Standard Time)

The bug is in those lines:

        for (var i = 0; i < results.length; ++i) {
            if (results[i][0] == results[0][0]) {
                newResults.push([results[i][0], this.exactSet[results[i][1]]]);
            }
        }

That alone causes that only best scored items is returned.
By some logic, this can fit implicit semantics of get method which usually returns concrete value, not list of search results.

My proposition would be to add search method that returns all results, sorted by score and delegate slicing (which this code does) to application, so it can choose how many results shall be shown.

Ryan Weal · Answer 6 · Fri Jan 06 2017 04:10:32 GMT+0800 (China Standard Time)

I'm also seeing this on the project example page.

Go here: http://glench.github.io/fuzzyset.js/#example type in "Mas" or "Mass".

Expected result: Massachusetts

Given results:
Mas -> Maine, Texas
Mass -> Maine

Interestingly, if you skip the first letter you get the correct result.

Ryan Weal · Answer 7 · Fri Jan 06 2017 04:20:20 GMT+0800 (China Standard Time)

Indeed @WillMa's fork resolves this. I had been testing with @washt's fork locally as that is what lives in npm but dropping in willma's replacement gives me multiple results and the results match my expectation.

Glen Chiacchieri · Answer 8 · Sat Jan 21 2017 06:31:15 GMT+0800 (China Standard Time)

Ok cool, does someone want to submit a pull request with the correct changes?

Wiktor Jakubczyc · Answer 9 · Fri Mar 24 2017 08:17:06 GMT+0800 (China Standard Time)

Another problem in the example: "Northern Marianas Islands" is in the list, but when you search for "Islands Northern" you get no results. Not such a good fuzzy implementation if it doesn't tokenize.

Jens Willmer · Answer 10 · Thu Jun 01 2017 17:15:02 GMT+0800 (China Standard Time)

@Glench would you mind to fix this issue for us? Since you know the project best and the investigation has already been done 😉

Wiktor Jakubczyc · Answer 11 · Thu Jun 01 2017 19:40:48 GMT+0800 (China Standard Time)

why reinvent the wheel? use something that works now https://github.com/atom/fuzzaldrin

Glen Chiacchieri · Answer 12 · Thu Jun 01 2017 23:53:37 GMT+0800 (China Standard Time)

working on it! It will probably be fixed today.

Glen Chiacchieri · Answer 13 · Fri Jun 02 2017 01:09:25 GMT+0800 (China Standard Time)

@monolithpl I think the problem with that particular example is that you should set useLevenshtein to false. I believe Levenshtein distance sees your example and the thing you think it should match as very far away from each other since it would require lots of transposition to make the strings match. Whereas just using cosine distance the entries have a higher match percentage.

Glen Chiacchieri · Answer 14 · Fri Jun 02 2017 03:58:08 GMT+0800 (China Standard Time)

@ryanweal Testing this now, it seems like setting useLevenshtein = false helps match partially-typed words better. I would suggest using that option.

Glen Chiacchieri · Answer 15 · Fri Jun 02 2017 05:20:38 GMT+0800 (China Standard Time)

Ok, so I fixed the library to look a bit more like @WillMa's fork. Now by default results with a score >= .33 are shown and that score can be passed to the get method as the third parameter e.g.

fuzzyset.get('hi', null, .5) (any result with a score >= .5 will be matched)

In general, I would also recommend trying to set useLevenshtein to false in the case of partial string matching. For such long entries such as URLs as in @jwillmer's original post, I would also try experimenting with the gram size, or trim the URLs so only relevant paths are stored in the search. Or maybe try @jeancroy's strategy (which I haven't tried). Thanks @jeancroy!

There's also now a rough debugging interface for fuzzyset.js that might help you understand what's going on a little better. I used it to try out @jwillmer's original example.