haskell / text

Repro:

ghci> :set -XOverloadedLists
ghci> import Data.Text.Internal.Validate
ghci> isValidUtf8ByteArray [128] 0 1
False
ghci> isValidUtf8ByteArray [0, 128] 1 1
True

(GHC 9.8.1 / Linux x64)

Apparently my GHC provided text-2.1 is not built with simdutf.
The Haskell fallback looks dodgy. I guess the len in line 137 should be end where end = off + len (same in line 143).

text/src/Data/Text/Internal/Validate.hs

Lines 132 to 137 in 73620de

    
           isValidUtf8ByteArrayHaskell# b !off !len = start off 
        
             where 
        
               indexWord8 :: ByteArray# -> Int -> Word8 
        
               indexWord8 !x (I# i) = W8# (indexWord8Array# x i) 
        
               start ix 
        
                 | ix >= len = True

text/src/Data/Text/Internal/Validate.hs

Line 143 in 73620de

| ix >= len = False

@andrewthad where did you take this code from? Are there any tests?

2. The Haskell fallback looks dodgy. I guess the len in line 137 should be end where end = off + len (same in line 143).

Good catch. Could you possibly prepare a PR please?

Good catch. Could you possibly prepare a PR please?

I am more inclined to fix this by using bytestring_is_valid_utf8. However, after spending some time on it, I am close to the conclusion that bytestring_is_valid_utf8 is broken as well.

haskell/bytestring#620

@sol The implementation of isValidUtf8ByteArrayHaskell# is copied from one of the shims for Data.Text.Internal.Validate.isValidUtf8ByteString:

isValidUtf8ByteString :: ByteString -> Bool
#ifdef SIMDUTF
isValidUtf8ByteString bs = withBS bs $ \fp len -> unsafeDupablePerformIO $
  unsafeWithForeignPtr fp $ \ptr -> (/= 0) <$> c_is_valid_utf8_ptr_unsafe ptr (fromIntegral len)
#else
#if MIN_VERSION_bytestring(0,11,2)
isValidUtf8ByteString = B.isValidUtf8
#else
isValidUtf8ByteString bs = start 0
  where
    start ix
      | ix >= B.length bs = True
      | otherwise = case utf8DecodeStart (B.unsafeIndex bs ix) of
        Accept{} -> start (ix + 1)
        Reject{} -> False
        Incomplete st _ -> step (ix + 1) st
    step ix st
      | ix >= B.length bs = False
      -- We do not use decoded code point, so passing a dummy value to save an argument.
      | otherwise = case utf8DecodeContinue (B.unsafeIndex bs ix) st (CodePoint 0) of
        Accept{} -> start (ix + 1)
        Reject{} -> False
        Incomplete st' _ -> step (ix + 1) st'
#endif
#endif

I've investigated a little to figure out where isValidUtf8ByteString came from. Before I added anything, isValidUtf8ByteString was already there, but it was named isValidBS instead. But, between the time that I put up my PR and the time it was merged, commit 7ef771d made it in, which removed isValidBS. So, from commit history on master, it looks like isValidUtf8ByteString just came out of nowhere, but if you go back to 6f1917d, you can its predecessor isValidBS.

I botched the adaptation though, and the comparison should be ix >= off + len instead of ix >= len. (Or computing end = off + len in a where clause and then using ix >= end instead.

There are not any tests for this function that I'm aware of.

	isValidUtf8ByteArrayHaskell# b !off !len = start off
	where
	indexWord8 :: ByteArray# -> Int -> Word8
	indexWord8 !x (I# i) = W8# (indexWord8Array# x i)
	start ix
	\| ix >= len = True

`isValidUtf8ByteArray` returns false positives