UTF8 decode on unpinned bytes

Question

UTF8 decode on unpinned bytes

andrewthad opened this issue a year ago · comments

I have, in the byteslice library, a type that looks like this:

data Bytes = Bytes
  { array :: {-# UNPACK #-} !ByteArray
  , offset :: {-# UNPACK #-} !Int
  , length :: {-# UNPACK #-} !Int
  }

This is the same thing as ByteString except that it doesn't require pinned memory and it cannot use memory that was allocated in C code. I'm trying to write this function (not in text, in my library):

decodeUtf8Bytes :: Text -> Maybe Bytes

The text library comes with a fast utf8 validation routine implemented in C++. However, it does not expose this in a way that lets me use it. To expose this, it would be sufficient to add this to text:

/* Add this to cbits/validate_utf8.cpp */
extern "C"
int _hs_text_is_valid_utf8_offset(const char* str, size_t off, size_t len){
  return simdutf::validate_utf8(str + off, len);
}

And a wrapper:

foreign import ccall unsafe "_hs_text_is_valid_utf8_offset" c_is_valid_utf8_offset
    :: ByteArray# -> CSize -> CSize -> IO CInt

With this wrapper, it becomes possible to perform UTF-8 validation of unpinned ByteArray# at arbitrary starting points.

If something like this were added to text, it could be exposed in an internal, unstable module. Let me know if this sounds like a welcome addition (and if it is, with some direction on where this should be exposed), and I can prepare a patch.

Callan McGill · Answer 1 · Fri Dec 09 2022 08:35:31 GMT+0800 (China Standard Time)

I think this would be a welcome addition (especially as it will only be promised internally).

ˌbodʲɪˈɡrʲim · Answer 2 · Fri Dec 09 2022 09:35:01 GMT+0800 (China Standard Time)

Looks reasonable to me.

Oleg Grenrus · Answer 3 · Fri Dec 09 2022 19:04:16 GMT+0800 (China Standard Time)

You probably want both ccall safe and ccall unsafe variants of _hs_text_is_valid_utf8_offset.

For big enough ByteArray# it will be pinned, so it might be good idea to check whether length is big enough, then check that array is actually pinned, and go the safe route.

Andrew Martin · Answer 4 · Tue Dec 27 2022 23:18:24 GMT+0800 (China Standard Time)

I've opened a PR with this at #483.

One thing I realized as I was doing this is that I need to provide a fallback when the SIMDUTF flag is off. I need to add a variant of the isValidBS fallback that works on ByteArray# instead of ByteString. I don't think this is terribly difficult, but I've not done it yet.

Andrew Martin · Answer 5 · Wed Dec 28 2022 00:54:31 GMT+0800 (China Standard Time)

I've added the important missing stuff.