haskell / text

Haskell library for space- and time-efficient operations over Unicode text.

Home Page:http://hackage.haskell.org/package/text

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

`isValidUtf8ByteArray` returns false positives

sol opened this issue · comments

Repro:

ghci> :set -XOverloadedLists
ghci> import Data.Text.Internal.Validate
ghci> isValidUtf8ByteArray [128] 0 1
False
ghci> isValidUtf8ByteArray [0, 128] 1 1
True

(GHC 9.8.1 / Linux x64)

  1. Apparently my GHC provided text-2.1 is not built with simdutf.
  2. The Haskell fallback looks dodgy. I guess the len in line 137 should be end where end = off + len (same in line 143).

isValidUtf8ByteArrayHaskell# b !off !len = start off
where
indexWord8 :: ByteArray# -> Int -> Word8
indexWord8 !x (I# i) = W8# (indexWord8Array# x i)
start ix
| ix >= len = True

| ix >= len = False

@andrewthad where did you take this code from? Are there any tests?

2. The Haskell fallback looks dodgy. I guess the len in line 137 should be end where end = off + len (same in line 143).

Good catch. Could you possibly prepare a PR please?

Good catch. Could you possibly prepare a PR please?

I am more inclined to fix this by using bytestring_is_valid_utf8. However, after spending some time on it, I am close to the conclusion that bytestring_is_valid_utf8 is broken as well.

@sol The implementation of isValidUtf8ByteArrayHaskell# is copied from one of the shims for Data.Text.Internal.Validate.isValidUtf8ByteString:

isValidUtf8ByteString :: ByteString -> Bool
#ifdef SIMDUTF
isValidUtf8ByteString bs = withBS bs $ \fp len -> unsafeDupablePerformIO $
  unsafeWithForeignPtr fp $ \ptr -> (/= 0) <$> c_is_valid_utf8_ptr_unsafe ptr (fromIntegral len)
#else
#if MIN_VERSION_bytestring(0,11,2)
isValidUtf8ByteString = B.isValidUtf8
#else
isValidUtf8ByteString bs = start 0
  where
    start ix
      | ix >= B.length bs = True
      | otherwise = case utf8DecodeStart (B.unsafeIndex bs ix) of
        Accept{} -> start (ix + 1)
        Reject{} -> False
        Incomplete st _ -> step (ix + 1) st
    step ix st
      | ix >= B.length bs = False
      -- We do not use decoded code point, so passing a dummy value to save an argument.
      | otherwise = case utf8DecodeContinue (B.unsafeIndex bs ix) st (CodePoint 0) of
        Accept{} -> start (ix + 1)
        Reject{} -> False
        Incomplete st' _ -> step (ix + 1) st'
#endif
#endif

I've investigated a little to figure out where isValidUtf8ByteString came from. Before I added anything, isValidUtf8ByteString was already there, but it was named isValidBS instead. But, between the time that I put up my PR and the time it was merged, commit 7ef771d made it in, which removed isValidBS. So, from commit history on master, it looks like isValidUtf8ByteString just came out of nowhere, but if you go back to 6f1917d, you can its predecessor isValidBS.

I botched the adaptation though, and the comparison should be ix >= off + len instead of ix >= len. (Or computing end = off + len in a where clause and then using ix >= end instead.

There are not any tests for this function that I'm aware of.