Decoupling byte-level encoding

Question

Decoupling byte-level encoding

BurningWitness opened this issue 9 months ago · comments

When writing a JSON parser (GaloisInc/json#17) I needed some way to decode UTF-8 and to my dismay I found all existing solutions do not fit my expectations:

GHC.Encoding.UTF8 and GHC.IO.Encoding are IO-based and I don't want that in a parser;
Data.Text.Internal.Encoding.Utf8, while pure, appears to both return Reject as an error and has a rather complex interface;
Data.Text.Encoding.* and Data.Text.Lazy.Encoding.* are already parsers themselves, too high-level for this task;
utf-string's Codec.Binary.UTF8.String consumes and returns lists, so it isn't parser-compatible.

I decided to handroll the UTF-8 decoding, which allowed me to categorize the errors (see Encoding.Mixed.Error) and resulted in a lot of code on the parser side that has little to do with consuming bytes per se (see Codec.Web.JSON.Parse.String).

However the code I wrote can instead be generalized to:

-- Assume Error is Encoding.Mixed.Error.Error

data UTF8 a = UTF8_1 a
            | Part_2 (Word8 -> UTF8_2 a)
            | Part_3_1 (Word8 -> Part_3_1 a)
            | Part_4_1 (Word8 -> Part_4_1 a)
            | Error_1 Error


data UTF8_2 a = UTF8_2 a
              | Error_2 Error


data Part_3_1 a = Part_3_2 (Word8 -> UTF8_3 a)
                | Error_3_1 Error

data UTF8_3 a = UTF8_3 a
              | Error_3_2 Error


data Part_4_1 a = Part_4_2 (Word8 -> Part_4_2 a)
                | Error_4_1 Error

data Part_4_2 a = Part_4_3 (Word8 -> UTF8_4 a)
                | Error_4_2 Error

data UTF8_4 a = UTF8_4 a
              | Error_4_3 Error


newtype Conv1 a = Conv1 (Word8 -> a)
newtype Conv2 a = Conv2 (Word8 -> Word8 -> a)
newtype Conv3 a = Conv3 (Word8 -> Word8 -> Word8 -> a)
newtype Conv4 a = Conv4 (Word8 -> Word8 -> Word8 -> Word8 -> a)

utf8 :: Conv1 a -> Conv2 a -> Conv3 a -> Conv4 a -> Word8 -> UTF8 a
utf8 = -- I'm omitting the implementation, but it's only 50 lines long

Parsing then is simply unwrapping UTF8. This decouples character validation and conversion, the only part of decoding left is ensuring only the maximal subpart of an ill-formed sequence is consumed, which is the responsibility of the parser.

My proposal is creating a separate package with a focus specifically on decoding/encoding UTF-8/UTF-16/UTF-32 on byte-level. Then text can drop some internal modules in favor of a simpler common interface.

This proposal is however naive: I do not know whether GHC can inline these datatypes reliably or, indeed, at all. Based on my cursory reading of the Secrets of the Glasgow Haskell Compiler inliner paper it should, as each of these expressions is trivial.

This doesn't clash with the issue of GHC's many UTF-8 implementations (outlined in GHC.Encoding.UTF8) as all other algorithms are in IO.

Other concerns:

text is a core library, so I assume an extra dependency can't just be added on a whim;
Package named utf already exists and is deprecated. I don't know how hard reclaiming deprecated packages is.

ˌbodʲɪˈɡrʲim · Answer 1 · Mon Aug 14 2023 00:08:28 GMT+0800 (China Standard Time)

Adding a dependency to text is too much of hassle IMO. But we can probably incorporate desired changes into text itself. Could you please elaborate why a naive parser from Data.Text.Internal.Encoding.Utf8 is not sufficient for your needs?

Oleksii Divak · Answer 2 · Mon Aug 14 2023 00:54:49 GMT+0800 (China Standard Time)

While Data.Text.Internal.Encoding.Utf8 is indeed functional enough to serve its purpose, my concerns are the following:

The interface is recursive, so the Incomplete state on the fourth byte is unreachable;
The Accept and Incomplete constructors force their fields, so returned codepoints need to be evaluated even if they're never used;
Ideally I'd want to share the error type with the text library, but alas DecodeError represents that as a String and there's no way to derive that from the Reject result.

I do have to admit that all of these issues are minor and I do not know why anyone would ever need to use succinct errors (other than cool error reporting), but the approach I'm proposing is the properly decoupled Haskell view of things.

One thing to note is that I haven't looked deep into the structure of Hoehrmann's C-based decoder, but from what I see the by-the-book decoding is just a chain of up to thirteen comparisons, so I don't yet understand the need for a complex state machine here (other than code shortness of course, but Haskell isn't C).

ˌbodʲɪˈɡrʲim · Answer 3 · Mon Aug 14 2023 01:48:27 GMT+0800 (China Standard Time)

For performance reasons two array lookups are much better than up to 13 comparisons.

Once Rejected, one is supposed to apply whatever error reporting desired. If you kept the previous state at hand, it should be fairly straightforward to do so.

Oleksii Divak · Answer 4 · Mon Aug 14 2023 02:12:17 GMT+0800 (China Standard Time)

For performance reasons two array lookups are much better than up to 13 comparisons.

Isn't this only true if the entire lookup table resides in L1 cache? Sure this will work fine for C parsers, but I don't know if any random Haskell parser interleaved with the algorithm can guarantee this.

Also it's 1 comparison for 00..7F and 5 for 80..7FF, so for really simple strings even two array lookups in L1 cache seem like an overkill.

Rolling a benchmark to compare the two approaches should be easy, so perhaps I should do that.

Xia Li-yao · Answer 5 · Mon Aug 14 2023 18:45:24 GMT+0800 (China Standard Time)

The main blocker for this proposal is going to be performance. I'd be surprised if you can use your API to write a streaming JSON parser whose performance is comparable to using the Data.Text.Internal.Encoding.Utf8 module or the recently added validateUtf8Chunk (etc.) primitives in Data.Text.Internal.Encoding.

There is an intentional trade off of a tiny bit of imprecision for a lot of performance. The parser state fits in a single byte (DecoderState), which can be easily unpacked by GHC optimizations into a tight loop that does no allocations. In contrast, an API like you propose with lots of first-class functions aims to more accurately represent the state machine for parsing UTF-8, reducing unreachable branches, but (1) GHC won't be able to optimize the allocations away, (2) it's unclear how that granularity results in practical benefits.

The interface is recursive, so the Incomplete state on the fourth byte is unreachable;

Making that state unreachable is really the main point of your API, and as you mention it's unclear what the use case would be.

The Accept and Incomplete constructors force their fields, so returned codepoints need to be evaluated even if they're never used;

The fields are one word each. The expectation is that they are going to be unpacked in a tight loop that does not allocate. This is much cheaper than allocating a thunk for the partial codepoint to be evaluated only if it is used. If you don't need the partial code point (only doing validation), then you can use updateDecoderState instead.

ˌbodʲɪˈɡrʲim · Answer 6 · Tue Aug 15 2023 04:40:05 GMT+0800 (China Standard Time)

Isn't this only true if the entire lookup table resides in L1 cache? Sure this will work fine for C parsers, but I don't know if any random Haskell parser interleaved with the algorithm can guarantee this.

This is likely to be true irrespective of caches: bear in mind that in your case each comparison is a condition and involves 1 or 2 jump instructions depending on branch chosen.

Oleksii Divak · Answer 7 · Mon Aug 21 2023 00:55:39 GMT+0800 (China Standard Time)

So after a week of tinkering I made a benchmark repository (link). The algorithms I wrote include no low-level magic, just inlines and strict copying. The benchmark timings can be found on the README.md there, and here are a few points that follow from the results:

GHC does indeed inline the data structure, even at -O1. I NOINLINEd both of the parsers I wrote and the only places that retain references to Codec.Encoding.UTF8 on the final STG are the Text variants on chunk end, solely because I force it in the Resume type;
Pretty much all UTF-8 decoding is done using simdutf, so on every chunk border the arrays have to be pulled back from the ether just to do 1-4 lookups;
decodeUtf8 does not follow the maximal subpart rule.

Problems I could not resolve:

For whatever reason I can't turn off the simdutf flag. If someone can try out decodeUtf8 without the SIMD algorithm, that'd be quite nice as it's probably the only place that clearly outperforms my solution;
Based on the fact that the SIMD version of my Text algorithm runs faster on late errors than the basic one, the latter one screws up inlining and it should be at least 10% faster when done right. This isn't that important, so I haven't dug into it;

Also I wonder why simdutf returns a boolean when it could return the last known valid UTF-8 boundary.

For the record all of my benchmarks have been executed on a laptop CPU, so, as with all things cache-related, YMMV, and extra benchmark subjects are welcome.

Xia Li-yao · Answer 8 · Mon Aug 21 2023 18:16:18 GMT+0800 (China Standard Time)

I concede that you can get your data structure to be inlined. But that relies on unrolling the loop yourself so you always start an iteration at a code point boundary. Performance-wise, the extra branching may have a detrimental effect on branch prediction. Your initial comment about IO made me assume you didn't want simdutf but I misunderstood. If the main loop uses simdutf then performance for the error branch is much less of a concern.

I'm still not convinced a more fine grained API for UTF-8 is really better. I disagree that in comparison "Data.Text.Internal.Encoding.Utf8 (...) has a rather complex interface." That API is an automaton, which is as simple and standard as it gets: byte goes in, new state and/or output comes out. You don't have to unroll four nested pattern-matches to use that API efficiently. I think the main bit of apparent complexity is that it exposes the internal state for error reporting, and that part of the interface could be cleaned up to make it easier to diagnose errors.

decodeUtf8 does not follow the maximal subpart rule.

That sounds like a bug, right?

Also I wonder why simdutf returns a boolean when it could return the last known valid UTF-8 boundary.

That's a feature request for simdutf. I don't know what the current status is, but it would indeed let us simplify UTF-8 parsing further.

Also, another API you haven't mentioned is Data.Text.Encoding.decodeUtf8Chunk. What's your opinion of that for your problem?

Oleksii Divak · Answer 9 · Mon Aug 21 2023 19:35:27 GMT+0800 (China Standard Time)

decodeUtf8 does not follow the maximal subpart rule ... sounds like a bug, right?

Yes, it should probably have its own issue. Can be replicated through the tests here.

You don't have to unroll four nested pattern-matches to use that API efficiently.

If you wish to respect maximal subpart rule, an error encountered on the first byte results in byte consumption and an error on any successive byte does not. As such you need to track byte boundaries, that's four repeats with the array lookup algorithm. The full unroll is just as deep on the 4-byte branch, and every other branch is shallower than that.

I disagree that in comparison "Data.Text.Internal.Encoding.Utf8 (...) has a rather complex interface."

I admit my phrasing on this point is incorrect, if anything the fact that it's in an Internal module is a much better reason to not use it.

I'm fine with it existing in a separate module with proper documentation, as it may indeed be useful for some highly specific parsers, but so far the benchmarks I linked above show it's not even performance-critical in this library.

another API you haven't mentioned is Data.Text.Encoding.decodeUtf8Chunk

I have, it's the third bullet point of this issue. A JSON parser needs to treat " as end-of-parse and \ as its own small subparser, so anything beyond a byte-level decoder doesn't fit the purpose.

Xia Li-yao · Answer 10 · Mon Aug 21 2023 20:28:45 GMT+0800 (China Standard Time)

Having conceded that performance is not an issue, the only remaining difference with Data.Text.Internal.Encoding.Utf8 I see is that your API lets you not have any parser state (except the offset) in between code points. Am I missing anything else?

The trade-off is that you have to write a big tree of nested cases to effectively use that API, since every byte within a codepoint results in a different type of state. Those 40 lines of code correspond to these 7 lines of code in the text library. So even purely in terms of aesthetics ("the properly decoupled Haskell view of things.") it's a hard sell.

The proposed API actually makes things more coupled than Data.Text.Internal.Encoding.Utf8 because it exposes too many details of UTF-8 in the types.

In that case, would making Data.Text.Internal.Encoding.Utf8 not internal resolve this?

Oleksii Divak · Answer 11 · Tue Aug 22 2023 00:02:32 GMT+0800 (China Standard Time)

lets you not have any parser state (except the offset) in between code points

I don't think "lets" is a correct term here, you can weave any state you want into it, it's just a datatype.

in terms of aesthetics ("the properly decoupled Haskell view of things.") it's a hard sell / exposes too much UTF-8 in the types

The entire point is that it exposes all the innerworkings while abstracting all the hard parts. "Decoupled" doesn't mean "short" or "convenient", it just means you get the power to write whatever you want with no frills. It's obviously a low-level module, so people using it will be ready to spend five extra minutes and duplicate 30 lines.

making Data.Text.Internal.Encoding.Utf8 not internal resolve this

It would definitely help with other people using it, but at this point I would rather carry around a 170 line module that does it in a much more streamlined fashion with predictable performance.

This applies to the StrictBuilder as well (I call it Copy on my side). The exposed API can be used to do what's advertised, but it's not exposed properly or documented succinctly enough to be useful.

For the record you don't need a strong reason to deny this proposal, a simple "we don't have people for this, sorry" is enough. The reason I'm pushing for it is because I already have two different places I want to use it in and I don't want to toss it onto my personal pile of "should be on Hackage, but uploading is a nuisance, I haven't tested it enough and the previous maintainer is nowhere to be seen" projects.

Xia Li-yao · Answer 12 · Tue Aug 22 2023 02:20:37 GMT+0800 (China Standard Time)

I'm just making sure that I'm not completely missing the point of your proposal. Beyond that, we'll indeed have to agree to disagree, unless another @haskell/text maintainer wants to chime in.

ˌbodʲɪˈɡrʲim · Answer 13 · Wed Aug 23 2023 07:21:56 GMT+0800 (China Standard Time)

Pretty much all UTF-8 decoding is done using simdutf, so on every chunk border the arrays have to be pulled back from the ether just to do 1-4 lookups;

There are three engines for UTF-8 validation in text:

If you can afford linking against C++, simdutf is used for bulk processing, and the naive engine kicks in only at the boundaries of chunks. Somewhat frustratingly, if you get precompiled text directly from GHC bindist, most likely simdutf flag is disabled (because of linking issues).
Otherwise if bytestring >= 0.11.5 (which is fairly new), we use UTF-8 engine from there (written in C) and again invoke the naive engine only at the boundaries.
Otherwise we use the naive engine full time.

I'm not sure what the supposed story for JavaScript backend: it might happen that it's better to use the naive engine instead of compiling C decoder from bytestring into JS.

If you want to benchmark Haskell native implementations, pass cabal build --constraint 'text -simdutf' --constraint 'bytestring < 0.11.5'.

ˌbodʲɪˈɡrʲim · Answer 14 · Wed Aug 23 2023 07:28:39 GMT+0800 (China Standard Time)

There are ways to embellish Data.Text.Internal.Encoding.Utf8, e. g., expose byteToClass, provide descriptive pattern synonyms for ByteClass, and add something like explain :: DecoderState -> ByteClass -> String, which produces an explanation what exactly went wrong. I am however reluctant to replace the mechanism entirely or add one more UTF-8 decoding engine.

I agree that a more fine-grained error reporting has its use cases, but I fell that it's better to iterate on it outside of text, in a separate package. Bear in mind, it is very difficult to change something in a boot library, and not easy to allow users to upgrade, so it's better to evolve API elsewhere.

Oleksii Divak · Answer 15 · Wed Aug 23 2023 18:59:39 GMT+0800 (China Standard Time)

pass cabal build --constraint 'text -simdutf' --constraint 'bytestring < 0.11.5'

While I was missing the fact that bytestring needs to have a specific version, neither contraints, nor cabal.project modifications, nor even specifying a bytestring boundary directly in the cabal file change anything. Even source-repository-package over a git clone doesn't apply, so I'm out of relatively sane options here.

There are three engines for UTF-8 validation

The performance concern applies specifically to the sidecase of using simdutf/bytestring C validator, since crossing chunk borders with continuations still uses the array lookup algorithm. This is something I have tested and it's slower even than naive comparisons (mind you my algorithm is actually very slow on this sidecase too, since I force it to allocate the data structure).

a more fine-grained error reporting

My original point was that I wanted to share error handling with text for consistency, but now I know that OnDecodeError effectively returns a constant String and an entirely ambiguous Word8. As such this point is moot.

Right now the proposal grinds down to the following points:

Both the current array lookup algorithm and zero-cost datatype versions could be moved into a separate package or a set of non-internal modules within text, which would also allow the removal of Data.Text.Internal.Encoding.Utf* modules;
OnDecodeError and UnicodeException do not provide any reliable error information and as such may be reduced to Maybe Char and a unit respectively;
There is a minor performance improvement to be gained from using regular branches instead of array lookup when using simdutf/C validation code;
It may be a good idea to move StrictBuilder out of internals as well.

As the main point of this issue is adding algorithms that are not immediately needed within the library and cannot be abstracted into a separate boot library for management reasons, this issue is indeed dead in the water. If noone else has any strong opinions on this topic, I will close the issue at the end of this week.

ˌbodʲɪˈɡrʲim · Answer 16 · Fri Aug 25 2023 06:58:41 GMT+0800 (China Standard Time)

Even source-repository-package over a git clone doesn't apply, so I'm out of relatively sane options here.

That's extremely strange, could you share a reproducer? Might be worse to raise a bug against Cabal.

The performance concern applies specifically to the sidecase of using simdutf/bytestring C validator,

My point above was that there are situations when the naive decoder is the only available, and its performance matters. If one wants to make a statements about performance, this case should be measured by disabling simdutf / bytestring.

the current array lookup algorithm ... could be moved into ... a set of non-internal modules within text

Makes sense to me.

OnDecodeError and UnicodeException do not provide any reliable error information and as such may be reduced to Maybe Char and a unit respectively;

That's largely true. Unfortunately, it's very difficult to iterate on a better interface without repeatedly breaking clients. There is not much demand although: usually clients treat pretty much any UTF-8 decoding error is just "hey, this data is not UTF-8", and precise offence reason matters less. I appreciate that JSON decoding is somewhat less forgiving.

Anyways thanks for your efforts and interest!

Oleksii Divak · Answer 17 · Mon Aug 28 2023 15:21:10 GMT+0800 (China Standard Time)

Okay, I was able to run the benchmarks without SIMD by git cloneing the package, renaming it and adding it to the packages section of the cabal.project, then adding PackageImports clarifications everywhere.

The results are surprisingly bad for the array lookup algorithm.

Variant	Benchmark
	Correct		Early errors		Late errors		Garbage
	32KiB	2MiB	32KiB	2MiB	32KiB	2MiB	32KiB	2MiB
Hoehrmann (SIMD)	13.69 μs	1.183 ms	22.45 μs	1.790 ms	163.8 μs	12.04 ms	7.104 ms	459.3 ms
Lazy (SIMD)	10.94 μs	888.5 μs	17.38 μs	1.255 ms	103.5 μs	7.962 ms	3.435 ms	221.0 ms
Hoehrmann	162.3 μs	12.24 ms	163.1 μs	11.68 ms	167.8 μs	12.52 ms	917.7 μs	58.82 ms
Lazy	93.24 μs	7.756 ms	119.0 μs	8.576 ms	121.6 μs	8.614 ms	611.7 μs	41.46 ms

I'm going to need someone to replicate this on their end and to check my findings for correctness, of course.

Oleksii Divak · Answer 18 · Mon Sep 04 2023 14:22:53 GMT+0800 (China Standard Time)

For the sake of benchmark reproducibility I incorporated the changes in a fork.

I have inlined everything best I could, the only thing I did not touch is Data.Text.Internal.StrictBuilder (appendR zero-length checks may be the cause for the SIMD performance losses seen previously).

The following list includes every single library benchmark that matches a pattern of $0 ~ /ecode/ .

73620de -- HEAD
ebb70b1 -- Naive algorithm (with 73620de as baseline)

17bb010 -- HEAD without SIMD validation
67dea22 -- Naive algorithm without SIMD validation (with 17bb010 as baseline)

Test case	`73620de`	`ebb70b1`		`17bb010`	`67dea22`
DecodeUtf8.html.Strict	69.9 μs	69.2 μs		1.41 ms	252 μs	−82%
DecodeUtf8.html.Stream	70.1 μs	68.4 μs		823 μs	264 μs	−67%
DecodeUtf8.html.StrictLength	111 μs	111 μs		1.45 ms	291 μs	−80%
DecodeUtf8.html.StrictInitLength	112 μs	109 μs		1.45 ms	291 μs	−79%
DecodeUtf8.html.Lazy	67.5 μs	68.6 μs		820 μs	262 μs	−67%
DecodeUtf8.html.LazyLength	112 μs	111 μs		857 μs	334 μs	−60%
DecodeUtf8.html.LazyInitLength	111 μs	110 μs		857 μs	301 μs	−64%
DecodeUtf8.xml.Strict	11.3 ms	11.3 ms		245 ms	71.7 ms	−70%
DecodeUtf8.xml.Stream	15.1 ms	14.9 ms		174 ms	78.2 ms	−55%
DecodeUtf8.xml.StrictLength	19.2 ms	18.7 ms		252 ms	79.4 ms	−68%
DecodeUtf8.xml.StrictInitLength	19.3 ms	19.1 ms		251 ms	79.4 ms	−68%
DecodeUtf8.xml.Lazy	13.4 ms	13.3 ms		170 ms	76.7 ms	−54%
DecodeUtf8.xml.LazyLength	19.8 ms	19.6 ms		176 ms	83.4 ms	−52%
DecodeUtf8.xml.LazyInitLength	19.7 ms	19.6 ms		175 ms	83.1 ms	−52%
DecodeUtf8.ascii.Strict	7.52 ms	7.46 ms		254 ms	35.5 ms	−86%
DecodeUtf8.ascii.Stream	11.3 ms	11.1 ms		162 ms	39.2 ms	−75%
DecodeUtf8.ascii.StrictLength	17.2 ms	16.1 ms		264 ms	44.5 ms	−83%
DecodeUtf8.ascii.StrictInitLength	15.8 ms	15.6 ms		263 ms	44.4 ms	−83%
DecodeUtf8.ascii.Lazy	12.4 ms	12.4 ms		161 ms	36.7 ms	−77%
DecodeUtf8.ascii.LazyLength	19.1 ms	18.7 ms		170 ms	44.8 ms	−73%
DecodeUtf8.ascii.LazyInitLength	18.9 ms	18.6 ms		168 ms	44.2 ms	−73%
DecodeUtf8.russian.Strict	1.17 ms	1.17 ms		25.5 ms	8.36 ms	−67%
DecodeUtf8.russian.Stream	1.37 ms	1.37 ms		16.4 ms	8.58 ms	−47%
DecodeUtf8.russian.StrictLength	1.88 ms	1.89 ms		26.0 ms	9.75 ms	−62%
DecodeUtf8.russian.StrictInitLength	1.88 ms	1.89 ms		26.0 ms	9.28 ms	−64%
DecodeUtf8.russian.Lazy	1.37 ms	1.37 ms		16.5 ms	8.57 ms	−48%
DecodeUtf8.russian.LazyLength	2.05 ms	2.03 ms		17.1 ms	9.24 ms	−46%
DecodeUtf8.russian.LazyInitLength	2.03 ms	2.04 ms		16.8 ms	9.24 ms	−45%
DecodeUtf8.japanese.Strict	3.61 μs	3.67 μs		59.0 μs	14.5 μs	−75%
DecodeUtf8.japanese.Stream	3.63 μs	3.72 μs		31.5 μs	14.5 μs	−53%
DecodeUtf8.japanese.StrictLength	5.34 μs	5.40 μs		60.9 μs	16.4 μs	−73%
DecodeUtf8.japanese.StrictInitLength	5.32 μs	5.40 μs		60.3 μs	16.1 μs	−73%
DecodeUtf8.japanese.Lazy	3.63 μs	3.62 μs		31.5 μs	14.5 μs	−53%
DecodeUtf8.japanese.LazyLength	5.44 μs	5.42 μs		33.2 μs	16.3 μs	−50%
DecodeUtf8.japanese.LazyInitLength	5.46 μs	5.43 μs		33.4 μs	16.1 μs	−51%
DecodeUtf8.ascii.strict decodeUtf8	7.66 ms	7.41 ms		256 ms	35.4 ms	−86%
DecodeUtf8.ascii.strict decodeLatin1	8.12 ms	8.02 ms		8.03 ms	8.06 ms
DecodeUtf8.ascii.strict decodeASCII	8.06 ms	8.05 ms		9.17 ms	8.06 ms	−12%
DecodeUtf8.ascii.lazy decodeUtf8	11.4 ms	11.0 ms	−3%	168 ms	37.3 ms	−77%
DecodeUtf8.ascii.lazy decodeLatin1	13.1 ms	13.1 ms		14.0 ms	13.0 ms	−7%
DecodeUtf8.ascii.lazy decodeASCII	11.6 ms	11.6 ms		13.0 ms	11.6 ms	−11%
Pure.tiny.decode.Text	35.6 ns	59.7 ns	+67%	27.7 ns	50.0 ns	+80%
Pure.tiny.decode.LazyText	116 ns	87.8 ns	−24%	133 ns	75.5 ns	−43%
Pure.tiny.decode'.Text	47.9 ns	74.4 ns	+55%	45.4 ns	62.7 ns	+38%
Pure.tiny.decode'.LazyText	150 ns	115 ns	−23%	159 ns	109 ns	−31%
Pure.tiny.length.decode.Text	45.5 ns	73.7 ns	+61%	38.5 ns	64.7 ns	+68%
Pure.tiny.length.decode.LazyText	130 ns	90.5 ns	−30%	146 ns	89.7 ns	−38%
Pure.ascii−small.decode.Text	9.48 μs	9.57 μs		311 μs	46.4 μs	−85%
Pure.ascii−small.decode.LazyText	11.6 μs	11.4 μs		237 μs	46.7 μs	−80%
Pure.ascii−small.decode'.Text	9.58 μs	9.54 μs		310 μs	45.4 μs	−85%
Pure.ascii−small.decode'.LazyText	11.6 μs	11.2 μs		237 μs	47.0 μs	−80%
Pure.ascii−small.length.decode.Text	18.9 μs	18.9 μs		318 μs	55.3 μs	−82%
Pure.ascii−small.length.decode.LazyText	20.4 μs	20.2 μs		244 μs	56.9 μs	−76%
Pure.ascii.decode.Text	7.45 ms	7.42 ms		258 ms	35.5 ms	−86%
Pure.ascii.decode.LazyText	20.8 ms	20.5 ms		205 ms	46.2 ms	−77%
Pure.ascii.decode'.Text	7.40 ms	7.41 ms		254 ms	35.6 ms	−86%
Pure.ascii.decode'.LazyText	20.6 ms	20.5 ms		205 ms	37.2 ms	−81%
Pure.ascii.length.decode.Text	15.6 ms	15.5 ms		264 ms	44.6 ms	−83%
Pure.ascii.length.decode.LazyText	19.9 ms	19.5 ms		213 ms	44.7 ms	−78%
Pure.english.decode.Text	245 μs	201 μs	−17%	17.8 ms	2.30 ms	−87%
Pure.english.decode.LazyText	807 μs	800 μs		14.4 ms	2.55 ms	−82%
Pure.english.decode'.Text	242 μs	201 μs	−16%	17.3 ms	2.30 ms	−86%
Pure.english.decode'.LazyText	817 μs	801 μs		14.2 ms	2.53 ms	−82%
Pure.english.length.decode.Text	916 μs	911 μs		18.0 ms	2.91 ms	−83%
Pure.english.length.decode.LazyText	1.35 ms	1.35 ms		14.1 ms	3.01 ms	−78%
Pure.russian.decode.Text	3.30 μs	3.35 μs		59.5 μs	20.4 μs	−65%
Pure.russian.decode.LazyText	3.39 μs	3.37 μs		41.7 μs	20.4 μs	−51%
Pure.russian.decode'.Text	3.31 μs	3.37 μs		60.1 μs	20.4 μs	−66%
Pure.russian.decode'.LazyText	3.45 μs	3.42 μs		41.8 μs	20.4 μs	−51%
Pure.russian.length.decode.Text	4.90 μs	4.97 μs		61.6 μs	21.9 μs	−64%
Pure.russian.length.decode.LazyText	5.03 μs	5.00 μs		43.2 μs	22.0 μs	−49%
Pure.japanese.decode.Text	3.53 μs	3.58 μs		59.0 μs	14.5 μs	−75%
Pure.japanese.decode.LazyText	3.73 μs	3.71 μs		34.4 μs	14.2 μs	−58%
Pure.japanese.decode'.Text	3.64 μs	3.69 μs		59.0 μs	14.5 μs	−75%
Pure.japanese.decode'.LazyText	3.77 μs	3.74 μs		34.4 μs	14.6 μs	−57%
Pure.japanese.length.decode.Text	5.32 μs	5.41 μs		60.8 μs	16.2 μs	−73%
Pure.japanese.length.decode.LazyText	5.49 μs	5.44 μs		36.1 μs	16.2 μs	−55%

ˌbodʲɪˈɡrʲim · Answer 19 · Tue Sep 05 2023 05:26:50 GMT+0800 (China Standard Time)

Thanks for benchmarking @BurningWitness. Sorry, I'm extra busy this week, will take a look later.

ˌbodʲɪˈɡrʲim · Answer 20 · Fri Oct 06 2023 04:52:15 GMT+0800 (China Standard Time)

@BurningWitness sorry again, I didn't forget about your work here, but still no time to dive in properly.