No safe `decodeASCII :: ByteString -> Maybe Text`

Question

No safe `decodeASCII :: ByteString -> Maybe Text`

raehik opened this issue a year ago · comments

I would like to efficiently parse a bytestring as an ASCII character string, which should disallow UTF-8. text-2.0 improved Data.Text.Encoding.decodeASCII, giving it its own definition rather than piggybacking off decodeUtf8; but it's partial, and has worse error handling. It's not easy to implement this efficiently as a local user, because there's a bundled C function for checking that a buffer is valid ASCII. It feels like a useful function for low level conversions (I certainly have a use in binrep).

The bytestring decoding in text feels clunky overall. I have to copy an unexposed snippet that catches thrown exceptions to convert them to Either UnicodeException Text. Could the interface here be improved? I would gladly take part in implementing them.

Ben Orchard · Answer 1 · Thu Jan 26 2023 21:52:20 GMT+0800 (China Standard Time)

Somewhat related, I feel like an efficient isAscii :: Text -> Bool could be exported with the new UTF-8 internal representation. text-short has it at Data.Text.Short.Internal.isAscii :: ShortText -> Bool. Text.all Char.isAscii is fine, but I'm conscious that it does a lot more work than it needs to.

Xia Li-yao · Answer 2 · Thu Jan 26 2023 22:32:08 GMT+0800 (China Standard Time)

Yes, there really should be total versions of all the decoding functions.

Ben Orchard · Answer 3 · Thu Jan 26 2023 23:10:41 GMT+0800 (China Standard Time)

Nice. This is on my radar, particularly isAscii :: Text -> Bool.

Ben Orchard · Answer 4 · Sat Jan 28 2023 02:14:47 GMT+0800 (China Standard Time)

Fast isAscii is tracked at #497 .

Ben Orchard · Answer 5 · Sun Jan 29 2023 02:29:07 GMT+0800 (China Standard Time)

Safe decodeASCII' :: ByteString -> Either Int Text is tracked at #499 .

Ben Orchard · Answer 6 · Thu Feb 16 2023 01:11:10 GMT+0800 (China Standard Time)

Both functions discussed here have been given efficient implementations and merged. On a larger scale, the decoding received an overhaul in #448 . Thanks to the maintainers and co who helped me for the feedback and speedy turnaround!