Search does not find all matches
jwillmer opened this issue · comments
Maybe I missunderstand the search but I think it does not work as it should. I use fuzisearch.js
to compare urls with each other.
If someone misstypes a url like:
http://jwillmer.github.io/jekyllDecent/features
he is redirected to an error page and I add all urls from my website:
{
"title": "Theme Installation and Usage",
"url": "http://jwillmer.github.io/jekyllDecent/blog/readme/Readme",
},
{
"title": "Theme Features",
"url": "http://jwillmer.github.io/jekyllDecent/blog/features/Features",
},
{
"title": "YAML Custom Features",
"url": "http://jwillmer.github.io/jekyllDecent/blog/features/YAML-Features",
},
{
"title": "This post demonstrates post content styles",
"url": "http://jwillmer.github.io/jekyllDecent/blog/features/Content",
}
to fuzisearch.js
and as search result I only get one url back:
http://jwillmer.github.io/jekyllDecent/blog/features/Content
Shouldn’t there be at least three urls that have some kind of match?
hi @jwillmer! Is is possible for you to set up a simple test page for me to see how you're using fuzzyset and how it's failing? That would make it a lot easier for me to understand if there's a bug and how to fix it. Thanks!
What is the problem with my example? The files are not minified, the init for fuzziset.js
is in offline.js
and you can find the repo at GitHub.
Okay, I see now. I was worried that there was lots of other code that would interfere with the bug-finding process, but it's not bad.
I briefly looked into it, but I don't know why it's happening. It's only matching on gram size 3, returning one result, and not checking for gram size 2 because it found a result for 3. I probably won't get around to checking this for a long time, but you might try messing with the gram sizes in the mean time, @jwillmer.
It could be that a fork of the project has fixed the bug without knowing that it is a bug: willlma@b6d9c59
The bug is in those lines:
for (var i = 0; i < results.length; ++i) {
if (results[i][0] == results[0][0]) {
newResults.push([results[i][0], this.exactSet[results[i][1]]]);
}
}
That alone causes that only best scored items is returned.
By some logic, this can fit implicit semantics of get
method which usually returns concrete value, not list of search results.
My proposition would be to add search
method that returns all results, sorted by score and delegate slicing (which this code does) to application, so it can choose how many results shall be shown.
I'm also seeing this on the project example page.
Go here: http://glench.github.io/fuzzyset.js/#example type in "Mas" or "Mass".
Expected result: Massachusetts
Given results:
Mas -> Maine, Texas
Mass -> Maine
Interestingly, if you skip the first letter you get the correct result.
Ok cool, does someone want to submit a pull request with the correct changes?
Another problem in the example: "Northern Marianas Islands" is in the list, but when you search for "Islands Northern" you get no results. Not such a good fuzzy implementation if it doesn't tokenize.
@Glench would you mind to fix this issue for us? Since you know the project best and the investigation has already been done 😉
why reinvent the wheel? use something that works now https://github.com/atom/fuzzaldrin
working on it! It will probably be fixed today.
@monolithpl I think the problem with that particular example is that you should set useLevenshtein
to false
. I believe Levenshtein distance sees your example and the thing you think it should match as very far away from each other since it would require lots of transposition to make the strings match. Whereas just using cosine distance the entries have a higher match percentage.
@ryanweal Testing this now, it seems like setting useLevenshtein = false
helps match partially-typed words better. I would suggest using that option.
Ok, so I fixed the library to look a bit more like @WillMa's fork. Now by default results with a score >= .33 are shown and that score can be passed to the get
method as the third parameter e.g.
fuzzyset.get('hi', null, .5)
(any result with a score >= .5 will be matched)
In general, I would also recommend trying to set useLevenshtein
to false
in the case of partial string matching. For such long entries such as URLs as in @jwillmer's original post, I would also try experimenting with the gram size, or trim the URLs so only relevant paths are stored in the search. Or maybe try @jeancroy's strategy (which I haven't tried). Thanks @jeancroy!
There's also now a rough debugging interface for fuzzyset.js that might help you understand what's going on a little better. I used it to try out @jwillmer's original example.