Glench / fuzzyset.js

fuzzyset.js - A fuzzy string set for javascript

Home Page:http://glench.github.io/fuzzyset.js/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Search does not find all matches

jwillmer opened this issue · comments

Maybe I missunderstand the search but I think it does not work as it should. I use fuzisearch.js to compare urls with each other.

If someone misstypes a url like:

http://jwillmer.github.io/jekyllDecent/features

he is redirected to an error page and I add all urls from my website:

    {
        "title": "Theme Installation and Usage",
        "url": "http://jwillmer.github.io/jekyllDecent/blog/readme/Readme",
    },
    {
        "title": "Theme Features",
        "url": "http://jwillmer.github.io/jekyllDecent/blog/features/Features",
    },
    {
        "title": "YAML Custom Features",
        "url": "http://jwillmer.github.io/jekyllDecent/blog/features/YAML-Features",
    },
    {
        "title": "This post demonstrates post content styles",
        "url": "http://jwillmer.github.io/jekyllDecent/blog/features/Content",
    }

to fuzisearch.js and as search result I only get one url back:

http://jwillmer.github.io/jekyllDecent/blog/features/Content

Shouldn’t there be at least three urls that have some kind of match?

hi @jwillmer! Is is possible for you to set up a simple test page for me to see how you're using fuzzyset and how it's failing? That would make it a lot easier for me to understand if there's a bug and how to fix it. Thanks!

What is the problem with my example? The files are not minified, the init for fuzziset.js is in offline.jsand you can find the repo at GitHub.

image

Okay, I see now. I was worried that there was lots of other code that would interfere with the bug-finding process, but it's not bad.

I briefly looked into it, but I don't know why it's happening. It's only matching on gram size 3, returning one result, and not checking for gram size 2 because it found a result for 3. I probably won't get around to checking this for a long time, but you might try messing with the gram sizes in the mean time, @jwillmer.

It could be that a fork of the project has fixed the bug without knowing that it is a bug: willlma@b6d9c59

The bug is in those lines:

        for (var i = 0; i < results.length; ++i) {
            if (results[i][0] == results[0][0]) {
                newResults.push([results[i][0], this.exactSet[results[i][1]]]);
            }
        }

That alone causes that only best scored items is returned.
By some logic, this can fit implicit semantics of get method which usually returns concrete value, not list of search results.

My proposition would be to add search method that returns all results, sorted by score and delegate slicing (which this code does) to application, so it can choose how many results shall be shown.

I'm also seeing this on the project example page.

Go here: http://glench.github.io/fuzzyset.js/#example type in "Mas" or "Mass".

Expected result: Massachusetts

Given results:
Mas -> Maine, Texas
Mass -> Maine

Interestingly, if you skip the first letter you get the correct result.

Indeed @WillMa's fork resolves this. I had been testing with @washt's fork locally as that is what lives in npm but dropping in willma's replacement gives me multiple results and the results match my expectation.

Ok cool, does someone want to submit a pull request with the correct changes?

Another problem in the example: "Northern Marianas Islands" is in the list, but when you search for "Islands Northern" you get no results. Not such a good fuzzy implementation if it doesn't tokenize.

@Glench would you mind to fix this issue for us? Since you know the project best and the investigation has already been done 😉

why reinvent the wheel? use something that works now https://github.com/atom/fuzzaldrin

working on it! It will probably be fixed today.

@monolithpl I think the problem with that particular example is that you should set useLevenshtein to false. I believe Levenshtein distance sees your example and the thing you think it should match as very far away from each other since it would require lots of transposition to make the strings match. Whereas just using cosine distance the entries have a higher match percentage.

@ryanweal Testing this now, it seems like setting useLevenshtein = false helps match partially-typed words better. I would suggest using that option.

Ok, so I fixed the library to look a bit more like @WillMa's fork. Now by default results with a score >= .33 are shown and that score can be passed to the get method as the third parameter e.g.

fuzzyset.get('hi', null, .5) (any result with a score >= .5 will be matched)

In general, I would also recommend trying to set useLevenshtein to false in the case of partial string matching. For such long entries such as URLs as in @jwillmer's original post, I would also try experimenting with the gram size, or trim the URLs so only relevant paths are stored in the search. Or maybe try @jeancroy's strategy (which I haven't tried). Thanks @jeancroy!

There's also now a rough debugging interface for fuzzyset.js that might help you understand what's going on a little better. I used it to try out @jwillmer's original example.