mishandling of utf8 replacement character

Question

mishandling of utf8 replacement character

josharian opened this issue a year ago · comments

Josh Bleecher Snyder commented a year ago

Add this test case to var fuzzyTests, and run the tests:

	{"\xffinvalid UTF-8\xff", "", false, -1},

Result:

--- FAIL: TestFuzzyMatchFold (0.00s)
panic: runtime error: slice bounds out of range [19:15] [recovered]
	panic: runtime error: slice bounds out of range [19:15]

goroutine 35 [running]:
testing.tRunner.func1.2({0x1031423e0, 0x14000114018})
	/Users/josh/go/1.20/src/testing/testing.go:1526 +0x1c8
testing.tRunner.func1()
	/Users/josh/go/1.20/src/testing/testing.go:1529 +0x384
panic({0x1031423e0, 0x14000114018})
	/Users/josh/go/1.20/src/runtime/panic.go:884 +0x204
golang.org/x/text/transform.String({0x103153b28, 0x103297c60}, {0x1030cb139, 0xf})
	/Users/josh/pkg/mod/golang.org/x/text@v0.9.0/transform/transform.go:650 +0x9e4
github.com/lithammer/fuzzysearch/fuzzy.stringTransform({0x1030cb139, 0xf}, {0x103153b28?, 0x103297c60?})
	/Users/josh/x/fuzzysearch/fuzzy/fuzzy.go:242 +0x64
github.com/lithammer/fuzzysearch/fuzzy.match({0x1030cb139?, 0x7?}, {0x0, 0x0}, {0x103153b28, 0x103297c60})
	/Users/josh/x/fuzzysearch/fuzzy/fuzzy.go:55 +0x38
github.com/lithammer/fuzzysearch/fuzzy.MatchFold(...)
	/Users/josh/x/fuzzysearch/fuzzy/fuzzy.go:41
github.com/lithammer/fuzzysearch/fuzzy.TestFuzzyMatchFold(0x1400011cb60)
	/Users/josh/x/fuzzysearch/fuzzy/fuzzy_test.go:65 +0xbc
testing.tRunner(0x1400011cb60, 0x103152088)
	/Users/josh/go/1.20/src/testing/testing.go:1576 +0x10c
created by testing.(*T).Run
	/Users/josh/go/1.20/src/testing/testing.go:1629 +0x368
exit status 2
FAIL	github.com/lithammer/fuzzysearch/fuzzy	0.131s

This existed prior to #53 (phew!). The root cause is that unicodeFoldTransformer.Transform is returning n, n, err, but when utf8.RuneError is present, nSrc may differ from nDst. I'll try to put together a fix sometime soonish.

(Found by fuzzing. Once the fuzz tests make it out of the gate without stumbling, I'll PR them.)

Josh Bleecher Snyder · Answer 1 · Fri May 05 2023 01:34:04 GMT+0800 (China Standard Time)

Though it is elegant and composes well, it is possible that moving away from package transform may end up making the code simpler, faster, and more robust. (Need to think about that a bit.)

Josh Bleecher Snyder · Answer 2 · Fri May 05 2023 05:10:13 GMT+0800 (China Standard Time)

{"Ⱦ", "", false, -1},

is another interesting test case because its lowercase form has a different UTF-8 encoded length than its uppercase form.