Deps updates since commit 86ee9c6 cause CJK characters fuzzy search throw error message of "undefined"

Question

Deps updates since commit 86ee9c6 cause CJK characters fuzzy search throw error message of "undefined"

c0001 opened this issue a year ago · comments

Hi maintainer:

For recent 1.8.5 release checking out on my usage side, all of fuzzy search method (i.e. any search mode unless precise) thru CJK (i.e. chinese, japan and korean) prefix gives me an error message of:

Error Message:  undefined

without any candidates listed anymore.

I've spent some time to investigate the problem by checking out recently git commits for locating the final no-problem commit, it is 98ee21e. And I've compared the latest commit upates content again with that no-problem commit to found that the only break changes between is the deps updates via commit 86ee9c6.

In my test, when I revert the deps updates via commit 86ee9c6 based on latest commit (i.e. the main branch), the problem is gone.

I've not test which dep's updating caused this issue, hope for founding and helping me how to fix this problem.

Thanks.

Simon Heimler · Answer 1 · Mon Mar 27 2023 01:52:56 GMT+0800 (China Standard Time)

Hi @c0001,

thanks for the report and the investigation!
From what you write, I'm rather sure that the problem is actually coming from the https://github.com/leeoniya/uFuzzy library. There we already have a newer version. I could try a dependency update to that and if that works, I can create a new release. If not, could you consider creating your report at the uFuzzy GitHub issues?

There seems to be one issue already: leeoniya/uFuzzy#14

Could you also send me some text examples which cause the issue? Then I can use this to replicate.

Thanks!
Simon

Entroperiance.J · Answer 2 · Mon Mar 27 2023 18:05:25 GMT+0800 (China Standard Time)

I've check the leeoniya/uFuzzy#14 issue which says ufuzzy may
need a more complex regexp specified for its option entry while used
for CJK terms based search, and below is my experimentation snippet:

const ufuzzy = require('@leeoniya/ufuzzy')

let opts = {
    unicode: false,
    interSplit: "(\p{Unified_Ideograph=yes})+"
};

let uf = new ufuzzy(opts)

const list = ['Javascript太棒了', 'Javascript素晴らしい', 'Javascript너무훌륭해훌륭해']

let strs = {cn: '太棒', jp: 'らし', kr: '훌륭'}

for (let i in strs) {
    console.log(i + ':' + uf.filter(list, strs[i]));
}

If I don't specify interSplit with CJK full stack range as above
(which I've taken from here which see Unified_Ideograph),
the search is failed as returning null even if I using the latest
version of uFuzzy. It seems uFuzzy can not use CJK chars for
calculating matches.

I just make a little bit try, hope that's useful.

Could you also send me some text examples which cause the issue?
Then I can use this to replicate.

The contents in above code example may be enough?

There we already have a newer version. I could try a dependency
update to that and if that works, I can create a new release

I've tested this lastest commit yet, still throwing same error.

If not, could you consider creating your report at the uFuzzy GitHub issues?

Yeah, by the investigation, I think that should be but may be
duplicated with leeoniya/uFuzzy#14, thus that may be later or
... any suggestion issued to?

Thanks!
Briant

Simon Heimler · Answer 3 · Tue Mar 28 2023 02:30:47 GMT+0800 (China Standard Time)

Hi Briant,

ok, I got it. So if I can fix this issue by just configuring the interSplit regexp, then I'll do that. I'm just not sure on the consequences on this.

But since you obviously can check out and build this project, I'll create a PR and can merge it - before making a release out of it. Then you and I can test a bit.

Simon Heimler · Answer 4 · Tue Mar 28 2023 03:09:39 GMT+0800 (China Standard Time)

Ok, merged it. For me it seemed to work fine - could you also try out and verify?

Entroperiance.J · Answer 5 · Tue Mar 28 2023 15:59:57 GMT+0800 (China Standard Time)

(Sorry for above duplicated commit references caused by my four times rebase, but I don't know how to remove them)

@Fannon

Thanks, that's seems worked for me as well, but I think that's not a usual way since interSplit is not a exposed fixed user option based on uFuzzy's README, and the commentary of interSplit (here) just say that's a term segmentation & punct/whitespace merging and defaults to [^A-Za-z0-9']+ which let me thought of that the p{Unified_Ideograph}=yes regexp is considerred as a divider filter rather than a mactch? If so, the fix of this may not be proper.

Thus, I make a new pull-request to provide a ufuzzyOptions user side customization which let us can do more test and tricks, is that OK?

Leon Sorokin · Answer 6 · Tue Mar 28 2023 20:12:21 GMT+0800 (China Standard Time)

yes, interSplit should only be used for white-space and punctuation. the default setting excludes apostrophes so that don't does not get matched as don t.

https://stackoverflow.com/questions/4328500/how-can-i-strip-all-punctuation-from-a-string-in-javascript-using-regex

Entroperiance.J · Answer 7 · Wed Mar 29 2023 00:45:45 GMT+0800 (China Standard Time)

Hi uFuzzy author @leeoniya :

Thanks for issuing the internal mechanism of option interSplit, and I've checked it with the regexp range covering whitespace and unicode punctuations, and worked for most of cases for example run bellow test.js:

const process = require('process')
const ufuzzy  = require('@leeoniya/ufuzzy')

let opts = {
    interSplit: "(^\\s|\\p{P})+",
};

let uf = new ufuzzy(opts)

const list = ['Javascript 太棒了', 'Javascript素晴らしい', 'Javascript너무훌륭해훌륭해']

let strs = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '太 棒', '素\t晴 ら', '훌”륭\t\t해 훌']

let c = 0
for (let i in strs) {
    let j = uf.filter(list, strs[i])
    if (!j) {
        console.log('\'' + strs[i] + '\'' + ': ' + j);
        c++;
    }
}
if (c!=0) {
    console.log("--> Failure of some letter search, Abort!")
    process.exit(1)
} else {
    console.log("--> All of searching available, Done!")
}

This given the result of

>$ node test.js 
--> All of searching available, Done!

But if I changed interSplit regexp to ([^a-zA-Z']|\\s|\\p{P})+ that plusses your defaults, the result is failure of all CJK patterns:

>$ node test_ufuzzy.js 
'太 棒': null
'素     晴 ら': null
'훌”륭          해 훌': null
--> Failure of some letter search, Abort!

That's why?

With checking uppon my previous post in this issue, if I change the interSplit regexp (\p{Unified_Ideograph=yes})+ to either [^a-zA-Z'\p{Unified_Ideograph=yes}]+ or ([^a-zA-Z']|\p{Unified_Ideograph=yes})+ for a compatibility test, the test is still failure with all of cn,jp and kr search result are null, even if I replace [^a-zA-Z] to purely simple "not" logical regexp char classes bracket (such as [^a]). Unless I remove the ^. It seems like the ^ keyword will also affect the result?

In the other hand, can you give more description of uFuzzy's designation of interSplit and either for its sibling intraSplit since I'm confused about what that is based on the understanding of interSplit which you've pointed out. Or return to the original issue, my question is that does uFuzzy can fully support fuzzy searching with extending Latin/Roman alphabet with CJK terms and any existent workaround can be released out here?

Thanks!
Briant

Simon Heimler · Answer 8 · Wed Mar 29 2023 02:06:16 GMT+0800 (China Standard Time)

Thanks @c0001 for working on further clarifying this!

So I'll revert my (then premature) fix in the master branch and wait until we found a good solution that works well for both Latin/Roman and CJK search items.

I do like the idea of just exposing uFuzzyOptions directly for power users. But I hope that we also find a more general, user friendly fix :)

Entroperiance.J · Answer 9 · Wed Mar 29 2023 17:16:25 GMT+0800 (China Standard Time)

According to #91, We have flexible uFuzzyOptions user
customization place now, so the issue can be halt temporarily.

The problem is for those users are not familiar with uFuzzy
specification that how we make a guide for them to be involved in. Or
based on this discussion of #90 , a simple supportCJK may be
approachable for common user. To my wish, let user-end be with
non-feeling of this change (i.e. find a robust way to make both of
them workable internally) may be Best.

Further more for any one first time dating with this problem, this
previous post may be useful.

Thanks!
Briant

Simon Heimler · Answer 10 · Thu Mar 30 2023 01:35:16 GMT+0800 (China Standard Time)

@c0001 : If you have any further insights how to best configure uFuzzy for your case, feel free to propose a better README fix or mention. In case we find a solution which works general-purpose, we can either have it in by default or if it comes with drawbacks, then have it as a dedicated option (like enableCJK: true)

Shall we close this or keep it open? Your call.

Entroperiance.J · Answer 11 · Thu Mar 30 2023 05:21:56 GMT+0800 (China Standard Time)

Of course I will, I've made some changes to that but hadn't got the best way even if following the leeoniya's suggestion to use fully punctuations and whitespace binding to interSplit since the new unterminated chars classes problem is found out. Uh...relax :)

Shall we close this or keep it open? Your call.

I've preferred to halt this issue as Close as not planned since there's no exact fix found here. I halt this now, if any problem is presenting on, whenever reopening is welcom.

Simon Heimler · Answer 12 · Thu Mar 30 2023 13:39:35 GMT+0800 (China Standard Time)

Ok, great. Thanks again!
I've now created a new release and submitted it to the stores. Usually this takes a few days.

Deps updates since commit 86ee9c6 cause CJK characters *fuzzy* search throw error message of "undefined"

Deps updates since commit 86ee9c6 cause CJK characters fuzzy search throw error message of "undefined"