UTF8 decode on unpinned bytes
andrewthad opened this issue · comments
I have, in the byteslice
library, a type that looks like this:
data Bytes = Bytes
{ array :: {-# UNPACK #-} !ByteArray
, offset :: {-# UNPACK #-} !Int
, length :: {-# UNPACK #-} !Int
}
This is the same thing as ByteString
except that it doesn't require pinned memory and it cannot use memory that was allocated in C code. I'm trying to write this function (not in text
, in my library):
decodeUtf8Bytes :: Text -> Maybe Bytes
The text
library comes with a fast utf8 validation routine implemented in C++. However, it does not expose this in a way that lets me use it. To expose this, it would be sufficient to add this to text
:
/* Add this to cbits/validate_utf8.cpp */
extern "C"
int _hs_text_is_valid_utf8_offset(const char* str, size_t off, size_t len){
return simdutf::validate_utf8(str + off, len);
}
And a wrapper:
foreign import ccall unsafe "_hs_text_is_valid_utf8_offset" c_is_valid_utf8_offset
:: ByteArray# -> CSize -> CSize -> IO CInt
With this wrapper, it becomes possible to perform UTF-8 validation of unpinned ByteArray#
at arbitrary starting points.
If something like this were added to text
, it could be exposed in an internal, unstable module. Let me know if this sounds like a welcome addition (and if it is, with some direction on where this should be exposed), and I can prepare a patch.
I think this would be a welcome addition (especially as it will only be promised internally).
Looks reasonable to me.
You probably want both ccall safe
and ccall unsafe
variants of _hs_text_is_valid_utf8_offset
.
For big enough ByteArray#
it will be pinned, so it might be good idea to check whether length is big enough, then check that array is actually pinned, and go the safe
route.
I've opened a PR with this at #483.
One thing I realized as I was doing this is that I need to provide a fallback when the SIMDUTF flag is off. I need to add a variant of the isValidBS
fallback that works on ByteArray#
instead of ByteString
. I don't think this is terribly difficult, but I've not done it yet.
I've added the important missing stuff.