`isValidUtf8ByteArray` returns false positives
sol opened this issue · comments
Repro:
ghci> :set -XOverloadedLists
ghci> import Data.Text.Internal.Validate
ghci> isValidUtf8ByteArray [128] 0 1
False
ghci> isValidUtf8ByteArray [0, 128] 1 1
True
(GHC 9.8.1 / Linux x64)
- Apparently my GHC provided
text-2.1
is not built withsimdutf
. - The Haskell fallback looks dodgy. I guess the
len
in line 137 should beend
whereend = off + len
(same in line 143).
text/src/Data/Text/Internal/Validate.hs
Lines 132 to 137 in 73620de
text/src/Data/Text/Internal/Validate.hs
Line 143 in 73620de
@andrewthad where did you take this code from? Are there any tests?
2. The Haskell fallback looks dodgy. I guess the
len
in line 137 should beend
whereend = off + len
(same in line 143).
Good catch. Could you possibly prepare a PR please?
Good catch. Could you possibly prepare a PR please?
I am more inclined to fix this by using bytestring_is_valid_utf8
. However, after spending some time on it, I am close to the conclusion that bytestring_is_valid_utf8
is broken as well.
@sol The implementation of isValidUtf8ByteArrayHaskell#
is copied from one of the shims for Data.Text.Internal.Validate.isValidUtf8ByteString
:
isValidUtf8ByteString :: ByteString -> Bool
#ifdef SIMDUTF
isValidUtf8ByteString bs = withBS bs $ \fp len -> unsafeDupablePerformIO $
unsafeWithForeignPtr fp $ \ptr -> (/= 0) <$> c_is_valid_utf8_ptr_unsafe ptr (fromIntegral len)
#else
#if MIN_VERSION_bytestring(0,11,2)
isValidUtf8ByteString = B.isValidUtf8
#else
isValidUtf8ByteString bs = start 0
where
start ix
| ix >= B.length bs = True
| otherwise = case utf8DecodeStart (B.unsafeIndex bs ix) of
Accept{} -> start (ix + 1)
Reject{} -> False
Incomplete st _ -> step (ix + 1) st
step ix st
| ix >= B.length bs = False
-- We do not use decoded code point, so passing a dummy value to save an argument.
| otherwise = case utf8DecodeContinue (B.unsafeIndex bs ix) st (CodePoint 0) of
Accept{} -> start (ix + 1)
Reject{} -> False
Incomplete st' _ -> step (ix + 1) st'
#endif
#endif
I've investigated a little to figure out where isValidUtf8ByteString
came from. Before I added anything, isValidUtf8ByteString
was already there, but it was named isValidBS
instead. But, between the time that I put up my PR and the time it was merged, commit 7ef771d made it in, which removed isValidBS
. So, from commit history on master
, it looks like isValidUtf8ByteString
just came out of nowhere, but if you go back to 6f1917d, you can its predecessor isValidBS
.
I botched the adaptation though, and the comparison should be ix >= off + len
instead of ix >= len
. (Or computing end = off + len
in a where
clause and then using ix >= end
instead.
There are not any tests for this function that I'm aware of.