haskell / text

Haskell library for space- and time-efficient operations over Unicode text.

Home Page:http://hackage.haskell.org/package/text

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Decoupling byte-level encoding

BurningWitness opened this issue · comments

When writing a JSON parser (GaloisInc/json#17) I needed some way to decode UTF-8 and to my dismay I found all existing solutions do not fit my expectations:

  • GHC.Encoding.UTF8 and GHC.IO.Encoding are IO-based and I don't want that in a parser;
  • Data.Text.Internal.Encoding.Utf8, while pure, appears to both return Reject as an error and has a rather complex interface;
  • Data.Text.Encoding.* and Data.Text.Lazy.Encoding.* are already parsers themselves, too high-level for this task;
  • utf-string's Codec.Binary.UTF8.String consumes and returns lists, so it isn't parser-compatible.

I decided to handroll the UTF-8 decoding, which allowed me to categorize the errors (see Encoding.Mixed.Error) and resulted in a lot of code on the parser side that has little to do with consuming bytes per se (see Codec.Web.JSON.Parse.String).

However the code I wrote can instead be generalized to:

-- Assume Error is Encoding.Mixed.Error.Error

data UTF8 a = UTF8_1 a
            | Part_2 (Word8 -> UTF8_2 a)
            | Part_3_1 (Word8 -> Part_3_1 a)
            | Part_4_1 (Word8 -> Part_4_1 a)
            | Error_1 Error


data UTF8_2 a = UTF8_2 a
              | Error_2 Error


data Part_3_1 a = Part_3_2 (Word8 -> UTF8_3 a)
                | Error_3_1 Error

data UTF8_3 a = UTF8_3 a
              | Error_3_2 Error


data Part_4_1 a = Part_4_2 (Word8 -> Part_4_2 a)
                | Error_4_1 Error

data Part_4_2 a = Part_4_3 (Word8 -> UTF8_4 a)
                | Error_4_2 Error

data UTF8_4 a = UTF8_4 a
              | Error_4_3 Error


newtype Conv1 a = Conv1 (Word8 -> a)
newtype Conv2 a = Conv2 (Word8 -> Word8 -> a)
newtype Conv3 a = Conv3 (Word8 -> Word8 -> Word8 -> a)
newtype Conv4 a = Conv4 (Word8 -> Word8 -> Word8 -> Word8 -> a)

utf8 :: Conv1 a -> Conv2 a -> Conv3 a -> Conv4 a -> Word8 -> UTF8 a
utf8 = -- I'm omitting the implementation, but it's only 50 lines long

Parsing then is simply unwrapping UTF8. This decouples character validation and conversion, the only part of decoding left is ensuring only the maximal subpart of an ill-formed sequence is consumed, which is the responsibility of the parser.


My proposal is creating a separate package with a focus specifically on decoding/encoding UTF-8/UTF-16/UTF-32 on byte-level. Then text can drop some internal modules in favor of a simpler common interface.

This proposal is however naive: I do not know whether GHC can inline these datatypes reliably or, indeed, at all. Based on my cursory reading of the Secrets of the Glasgow Haskell Compiler inliner paper it should, as each of these expressions is trivial.

This doesn't clash with the issue of GHC's many UTF-8 implementations (outlined in GHC.Encoding.UTF8) as all other algorithms are in IO.

Other concerns:

  • text is a core library, so I assume an extra dependency can't just be added on a whim;
  • Package named utf already exists and is deprecated. I don't know how hard reclaiming deprecated packages is.

Adding a dependency to text is too much of hassle IMO. But we can probably incorporate desired changes into text itself. Could you please elaborate why a naive parser from Data.Text.Internal.Encoding.Utf8 is not sufficient for your needs?

While Data.Text.Internal.Encoding.Utf8 is indeed functional enough to serve its purpose, my concerns are the following:

  • The interface is recursive, so the Incomplete state on the fourth byte is unreachable;
  • The Accept and Incomplete constructors force their fields, so returned codepoints need to be evaluated even if they're never used;
  • Ideally I'd want to share the error type with the text library, but alas DecodeError represents that as a String and there's no way to derive that from the Reject result.

I do have to admit that all of these issues are minor and I do not know why anyone would ever need to use succinct errors (other than cool error reporting), but the approach I'm proposing is the properly decoupled Haskell view of things.

One thing to note is that I haven't looked deep into the structure of Hoehrmann's C-based decoder, but from what I see the by-the-book decoding is just a chain of up to thirteen comparisons, so I don't yet understand the need for a complex state machine here (other than code shortness of course, but Haskell isn't C).

For performance reasons two array lookups are much better than up to 13 comparisons.

Once Rejected, one is supposed to apply whatever error reporting desired. If you kept the previous state at hand, it should be fairly straightforward to do so.

For performance reasons two array lookups are much better than up to 13 comparisons.

Isn't this only true if the entire lookup table resides in L1 cache? Sure this will work fine for C parsers, but I don't know if any random Haskell parser interleaved with the algorithm can guarantee this.

Also it's 1 comparison for 00..7F and 5 for 80..7FF, so for really simple strings even two array lookups in L1 cache seem like an overkill.

Rolling a benchmark to compare the two approaches should be easy, so perhaps I should do that.

The main blocker for this proposal is going to be performance. I'd be surprised if you can use your API to write a streaming JSON parser whose performance is comparable to using the Data.Text.Internal.Encoding.Utf8 module or the recently added validateUtf8Chunk (etc.) primitives in Data.Text.Internal.Encoding.

There is an intentional trade off of a tiny bit of imprecision for a lot of performance. The parser state fits in a single byte (DecoderState), which can be easily unpacked by GHC optimizations into a tight loop that does no allocations. In contrast, an API like you propose with lots of first-class functions aims to more accurately represent the state machine for parsing UTF-8, reducing unreachable branches, but (1) GHC won't be able to optimize the allocations away, (2) it's unclear how that granularity results in practical benefits.

The interface is recursive, so the Incomplete state on the fourth byte is unreachable;

Making that state unreachable is really the main point of your API, and as you mention it's unclear what the use case would be.

The Accept and Incomplete constructors force their fields, so returned codepoints need to be evaluated even if they're never used;

The fields are one word each. The expectation is that they are going to be unpacked in a tight loop that does not allocate. This is much cheaper than allocating a thunk for the partial codepoint to be evaluated only if it is used. If you don't need the partial code point (only doing validation), then you can use updateDecoderState instead.

Isn't this only true if the entire lookup table resides in L1 cache? Sure this will work fine for C parsers, but I don't know if any random Haskell parser interleaved with the algorithm can guarantee this.

This is likely to be true irrespective of caches: bear in mind that in your case each comparison is a condition and involves 1 or 2 jump instructions depending on branch chosen.

So after a week of tinkering I made a benchmark repository (link). The algorithms I wrote include no low-level magic, just inlines and strict copying. The benchmark timings can be found on the README.md there, and here are a few points that follow from the results:

  • GHC does indeed inline the data structure, even at -O1. I NOINLINEd both of the parsers I wrote and the only places that retain references to Codec.Encoding.UTF8 on the final STG are the Text variants on chunk end, solely because I force it in the Resume type;

  • Pretty much all UTF-8 decoding is done using simdutf, so on every chunk border the arrays have to be pulled back from the ether just to do 1-4 lookups;

  • decodeUtf8 does not follow the maximal subpart rule.

Problems I could not resolve:

  • For whatever reason I can't turn off the simdutf flag. If someone can try out decodeUtf8 without the SIMD algorithm, that'd be quite nice as it's probably the only place that clearly outperforms my solution;

  • Based on the fact that the SIMD version of my Text algorithm runs faster on late errors than the basic one, the latter one screws up inlining and it should be at least 10% faster when done right. This isn't that important, so I haven't dug into it;

Also I wonder why simdutf returns a boolean when it could return the last known valid UTF-8 boundary.


For the record all of my benchmarks have been executed on a laptop CPU, so, as with all things cache-related, YMMV, and extra benchmark subjects are welcome.

I concede that you can get your data structure to be inlined. But that relies on unrolling the loop yourself so you always start an iteration at a code point boundary. Performance-wise, the extra branching may have a detrimental effect on branch prediction. Your initial comment about IO made me assume you didn't want simdutf but I misunderstood. If the main loop uses simdutf then performance for the error branch is much less of a concern.

I'm still not convinced a more fine grained API for UTF-8 is really better. I disagree that in comparison "Data.Text.Internal.Encoding.Utf8 (...) has a rather complex interface." That API is an automaton, which is as simple and standard as it gets: byte goes in, new state and/or output comes out. You don't have to unroll four nested pattern-matches to use that API efficiently. I think the main bit of apparent complexity is that it exposes the internal state for error reporting, and that part of the interface could be cleaned up to make it easier to diagnose errors.

decodeUtf8 does not follow the maximal subpart rule.

That sounds like a bug, right?

Also I wonder why simdutf returns a boolean when it could return the last known valid UTF-8 boundary.

That's a feature request for simdutf. I don't know what the current status is, but it would indeed let us simplify UTF-8 parsing further.

Also, another API you haven't mentioned is Data.Text.Encoding.decodeUtf8Chunk. What's your opinion of that for your problem?

decodeUtf8 does not follow the maximal subpart rule ... sounds like a bug, right?

Yes, it should probably have its own issue. Can be replicated through the tests here.

You don't have to unroll four nested pattern-matches to use that API efficiently.

If you wish to respect maximal subpart rule, an error encountered on the first byte results in byte consumption and an error on any successive byte does not. As such you need to track byte boundaries, that's four repeats with the array lookup algorithm. The full unroll is just as deep on the 4-byte branch, and every other branch is shallower than that.

I disagree that in comparison "Data.Text.Internal.Encoding.Utf8 (...) has a rather complex interface."

I admit my phrasing on this point is incorrect, if anything the fact that it's in an Internal module is a much better reason to not use it.

I'm fine with it existing in a separate module with proper documentation, as it may indeed be useful for some highly specific parsers, but so far the benchmarks I linked above show it's not even performance-critical in this library.

another API you haven't mentioned is Data.Text.Encoding.decodeUtf8Chunk

I have, it's the third bullet point of this issue. A JSON parser needs to treat " as end-of-parse and \ as its own small subparser, so anything beyond a byte-level decoder doesn't fit the purpose.

Having conceded that performance is not an issue, the only remaining difference with Data.Text.Internal.Encoding.Utf8 I see is that your API lets you not have any parser state (except the offset) in between code points. Am I missing anything else?

The trade-off is that you have to write a big tree of nested cases to effectively use that API, since every byte within a codepoint results in a different type of state. Those 40 lines of code correspond to these 7 lines of code in the text library. So even purely in terms of aesthetics ("the properly decoupled Haskell view of things.") it's a hard sell.

The proposed API actually makes things more coupled than Data.Text.Internal.Encoding.Utf8 because it exposes too many details of UTF-8 in the types.

In that case, would making Data.Text.Internal.Encoding.Utf8 not internal resolve this?

lets you not have any parser state (except the offset) in between code points

I don't think "lets" is a correct term here, you can weave any state you want into it, it's just a datatype.

in terms of aesthetics ("the properly decoupled Haskell view of things.") it's a hard sell / exposes too much UTF-8 in the types

The entire point is that it exposes all the innerworkings while abstracting all the hard parts. "Decoupled" doesn't mean "short" or "convenient", it just means you get the power to write whatever you want with no frills. It's obviously a low-level module, so people using it will be ready to spend five extra minutes and duplicate 30 lines.

making Data.Text.Internal.Encoding.Utf8 not internal resolve this

It would definitely help with other people using it, but at this point I would rather carry around a 170 line module that does it in a much more streamlined fashion with predictable performance.

This applies to the StrictBuilder as well (I call it Copy on my side). The exposed API can be used to do what's advertised, but it's not exposed properly or documented succinctly enough to be useful.


For the record you don't need a strong reason to deny this proposal, a simple "we don't have people for this, sorry" is enough. The reason I'm pushing for it is because I already have two different places I want to use it in and I don't want to toss it onto my personal pile of "should be on Hackage, but uploading is a nuisance, I haven't tested it enough and the previous maintainer is nowhere to be seen" projects.

I'm just making sure that I'm not completely missing the point of your proposal. Beyond that, we'll indeed have to agree to disagree, unless another @haskell/text maintainer wants to chime in.

  • Pretty much all UTF-8 decoding is done using simdutf, so on every chunk border the arrays have to be pulled back from the ether just to do 1-4 lookups;

There are three engines for UTF-8 validation in text:

  • If you can afford linking against C++, simdutf is used for bulk processing, and the naive engine kicks in only at the boundaries of chunks. Somewhat frustratingly, if you get precompiled text directly from GHC bindist, most likely simdutf flag is disabled (because of linking issues).
  • Otherwise if bytestring >= 0.11.5 (which is fairly new), we use UTF-8 engine from there (written in C) and again invoke the naive engine only at the boundaries.
  • Otherwise we use the naive engine full time.

I'm not sure what the supposed story for JavaScript backend: it might happen that it's better to use the naive engine instead of compiling C decoder from bytestring into JS.

If you want to benchmark Haskell native implementations, pass cabal build --constraint 'text -simdutf' --constraint 'bytestring < 0.11.5'.

There are ways to embellish Data.Text.Internal.Encoding.Utf8, e. g., expose byteToClass, provide descriptive pattern synonyms for ByteClass, and add something like explain :: DecoderState -> ByteClass -> String, which produces an explanation what exactly went wrong. I am however reluctant to replace the mechanism entirely or add one more UTF-8 decoding engine.

I agree that a more fine-grained error reporting has its use cases, but I fell that it's better to iterate on it outside of text, in a separate package. Bear in mind, it is very difficult to change something in a boot library, and not easy to allow users to upgrade, so it's better to evolve API elsewhere.

pass cabal build --constraint 'text -simdutf' --constraint 'bytestring < 0.11.5'

While I was missing the fact that bytestring needs to have a specific version, neither contraints, nor cabal.project modifications, nor even specifying a bytestring boundary directly in the cabal file change anything. Even source-repository-package over a git clone doesn't apply, so I'm out of relatively sane options here.

There are three engines for UTF-8 validation

The performance concern applies specifically to the sidecase of using simdutf/bytestring C validator, since crossing chunk borders with continuations still uses the array lookup algorithm. This is something I have tested and it's slower even than naive comparisons (mind you my algorithm is actually very slow on this sidecase too, since I force it to allocate the data structure).

a more fine-grained error reporting

My original point was that I wanted to share error handling with text for consistency, but now I know that OnDecodeError effectively returns a constant String and an entirely ambiguous Word8. As such this point is moot.


Right now the proposal grinds down to the following points:

  • Both the current array lookup algorithm and zero-cost datatype versions could be moved into a separate package or a set of non-internal modules within text, which would also allow the removal of Data.Text.Internal.Encoding.Utf* modules;

  • OnDecodeError and UnicodeException do not provide any reliable error information and as such may be reduced to Maybe Char and a unit respectively;

  • There is a minor performance improvement to be gained from using regular branches instead of array lookup when using simdutf/C validation code;

  • It may be a good idea to move StrictBuilder out of internals as well.

As the main point of this issue is adding algorithms that are not immediately needed within the library and cannot be abstracted into a separate boot library for management reasons, this issue is indeed dead in the water. If noone else has any strong opinions on this topic, I will close the issue at the end of this week.

Even source-repository-package over a git clone doesn't apply, so I'm out of relatively sane options here.

That's extremely strange, could you share a reproducer? Might be worse to raise a bug against Cabal.

The performance concern applies specifically to the sidecase of using simdutf/bytestring C validator,

My point above was that there are situations when the naive decoder is the only available, and its performance matters. If one wants to make a statements about performance, this case should be measured by disabling simdutf / bytestring.

the current array lookup algorithm ... could be moved into ... a set of non-internal modules within text

Makes sense to me.

  • OnDecodeError and UnicodeException do not provide any reliable error information and as such may be reduced to Maybe Char and a unit respectively;

That's largely true. Unfortunately, it's very difficult to iterate on a better interface without repeatedly breaking clients. There is not much demand although: usually clients treat pretty much any UTF-8 decoding error is just "hey, this data is not UTF-8", and precise offence reason matters less. I appreciate that JSON decoding is somewhat less forgiving.

Anyways thanks for your efforts and interest!

Okay, I was able to run the benchmarks without SIMD by git cloneing the package, renaming it and adding it to the packages section of the cabal.project, then adding PackageImports clarifications everywhere.

The results are surprisingly bad for the array lookup algorithm.

Variant Benchmark
Correct Early errors Late errors Garbage
32KiB 2MiB 32KiB 2MiB 32KiB 2MiB 32KiB 2MiB
Hoehrmann (SIMD) 13.69 μs 1.183 ms 22.45 μs 1.790 ms 163.8 μs 12.04 ms 7.104 ms 459.3 ms
Lazy (SIMD) 10.94 μs 888.5 μs 17.38 μs 1.255 ms 103.5 μs 7.962 ms 3.435 ms 221.0 ms
Hoehrmann 162.3 μs 12.24 ms 163.1 μs 11.68 ms 167.8 μs 12.52 ms 917.7 μs 58.82 ms
Lazy 93.24 μs 7.756 ms 119.0 μs 8.576 ms 121.6 μs 8.614 ms 611.7 μs 41.46 ms

I'm going to need someone to replicate this on their end and to check my findings for correctness, of course.

For the sake of benchmark reproducibility I incorporated the changes in a fork.

I have inlined everything best I could, the only thing I did not touch is Data.Text.Internal.StrictBuilder (appendR zero-length checks may be the cause for the SIMD performance losses seen previously).


The following list includes every single library benchmark that matches a pattern of $0 ~ /ecode/ .

73620de -- HEAD
ebb70b1 -- Naive algorithm (with 73620de as baseline)

17bb010 -- HEAD without SIMD validation
67dea22 -- Naive algorithm without SIMD validation (with 17bb010 as baseline)

Test case 73620de ebb70b1 17bb010 67dea22
DecodeUtf8.html.Strict 69.9 μs 69.2 μs 1.41 ms 252 μs −82%
DecodeUtf8.html.Stream 70.1 μs 68.4 μs 823 μs 264 μs −67%
DecodeUtf8.html.StrictLength 111 μs 111 μs 1.45 ms 291 μs −80%
DecodeUtf8.html.StrictInitLength 112 μs 109 μs 1.45 ms 291 μs −79%
DecodeUtf8.html.Lazy 67.5 μs 68.6 μs 820 μs 262 μs −67%
DecodeUtf8.html.LazyLength 112 μs 111 μs 857 μs 334 μs −60%
DecodeUtf8.html.LazyInitLength 111 μs 110 μs 857 μs 301 μs −64%
DecodeUtf8.xml.Strict 11.3 ms 11.3 ms 245 ms 71.7 ms −70%
DecodeUtf8.xml.Stream 15.1 ms 14.9 ms 174 ms 78.2 ms −55%
DecodeUtf8.xml.StrictLength 19.2 ms 18.7 ms 252 ms 79.4 ms −68%
DecodeUtf8.xml.StrictInitLength 19.3 ms 19.1 ms 251 ms 79.4 ms −68%
DecodeUtf8.xml.Lazy 13.4 ms 13.3 ms 170 ms 76.7 ms −54%
DecodeUtf8.xml.LazyLength 19.8 ms 19.6 ms 176 ms 83.4 ms −52%
DecodeUtf8.xml.LazyInitLength 19.7 ms 19.6 ms 175 ms 83.1 ms −52%
DecodeUtf8.ascii.Strict 7.52 ms 7.46 ms 254 ms 35.5 ms −86%
DecodeUtf8.ascii.Stream 11.3 ms 11.1 ms 162 ms 39.2 ms −75%
DecodeUtf8.ascii.StrictLength 17.2 ms 16.1 ms 264 ms 44.5 ms −83%
DecodeUtf8.ascii.StrictInitLength 15.8 ms 15.6 ms 263 ms 44.4 ms −83%
DecodeUtf8.ascii.Lazy 12.4 ms 12.4 ms 161 ms 36.7 ms −77%
DecodeUtf8.ascii.LazyLength 19.1 ms 18.7 ms 170 ms 44.8 ms −73%
DecodeUtf8.ascii.LazyInitLength 18.9 ms 18.6 ms 168 ms 44.2 ms −73%
DecodeUtf8.russian.Strict 1.17 ms 1.17 ms 25.5 ms 8.36 ms −67%
DecodeUtf8.russian.Stream 1.37 ms 1.37 ms 16.4 ms 8.58 ms −47%
DecodeUtf8.russian.StrictLength 1.88 ms 1.89 ms 26.0 ms 9.75 ms −62%
DecodeUtf8.russian.StrictInitLength 1.88 ms 1.89 ms 26.0 ms 9.28 ms −64%
DecodeUtf8.russian.Lazy 1.37 ms 1.37 ms 16.5 ms 8.57 ms −48%
DecodeUtf8.russian.LazyLength 2.05 ms 2.03 ms 17.1 ms 9.24 ms −46%
DecodeUtf8.russian.LazyInitLength 2.03 ms 2.04 ms 16.8 ms 9.24 ms −45%
DecodeUtf8.japanese.Strict 3.61 μs 3.67 μs 59.0 μs 14.5 μs −75%
DecodeUtf8.japanese.Stream 3.63 μs 3.72 μs 31.5 μs 14.5 μs −53%
DecodeUtf8.japanese.StrictLength 5.34 μs 5.40 μs 60.9 μs 16.4 μs −73%
DecodeUtf8.japanese.StrictInitLength 5.32 μs 5.40 μs 60.3 μs 16.1 μs −73%
DecodeUtf8.japanese.Lazy 3.63 μs 3.62 μs 31.5 μs 14.5 μs −53%
DecodeUtf8.japanese.LazyLength 5.44 μs 5.42 μs 33.2 μs 16.3 μs −50%
DecodeUtf8.japanese.LazyInitLength 5.46 μs 5.43 μs 33.4 μs 16.1 μs −51%
DecodeUtf8.ascii.strict decodeUtf8 7.66 ms 7.41 ms 256 ms 35.4 ms −86%
DecodeUtf8.ascii.strict decodeLatin1 8.12 ms 8.02 ms 8.03 ms 8.06 ms
DecodeUtf8.ascii.strict decodeASCII 8.06 ms 8.05 ms 9.17 ms 8.06 ms −12%
DecodeUtf8.ascii.lazy decodeUtf8 11.4 ms 11.0 ms −3% 168 ms 37.3 ms −77%
DecodeUtf8.ascii.lazy decodeLatin1 13.1 ms 13.1 ms 14.0 ms 13.0 ms −7%
DecodeUtf8.ascii.lazy decodeASCII 11.6 ms 11.6 ms 13.0 ms 11.6 ms −11%
Pure.tiny.decode.Text 35.6 ns 59.7 ns +67% 27.7 ns 50.0 ns +80%
Pure.tiny.decode.LazyText 116 ns 87.8 ns −24% 133 ns 75.5 ns −43%
Pure.tiny.decode'.Text 47.9 ns 74.4 ns +55% 45.4 ns 62.7 ns +38%
Pure.tiny.decode'.LazyText 150 ns 115 ns −23% 159 ns 109 ns −31%
Pure.tiny.length.decode.Text 45.5 ns 73.7 ns +61% 38.5 ns 64.7 ns +68%
Pure.tiny.length.decode.LazyText 130 ns 90.5 ns −30% 146 ns 89.7 ns −38%
Pure.ascii−small.decode.Text 9.48 μs 9.57 μs 311 μs 46.4 μs −85%
Pure.ascii−small.decode.LazyText 11.6 μs 11.4 μs 237 μs 46.7 μs −80%
Pure.ascii−small.decode'.Text 9.58 μs 9.54 μs 310 μs 45.4 μs −85%
Pure.ascii−small.decode'.LazyText 11.6 μs 11.2 μs 237 μs 47.0 μs −80%
Pure.ascii−small.length.decode.Text 18.9 μs 18.9 μs 318 μs 55.3 μs −82%
Pure.ascii−small.length.decode.LazyText 20.4 μs 20.2 μs 244 μs 56.9 μs −76%
Pure.ascii.decode.Text 7.45 ms 7.42 ms 258 ms 35.5 ms −86%
Pure.ascii.decode.LazyText 20.8 ms 20.5 ms 205 ms 46.2 ms −77%
Pure.ascii.decode'.Text 7.40 ms 7.41 ms 254 ms 35.6 ms −86%
Pure.ascii.decode'.LazyText 20.6 ms 20.5 ms 205 ms 37.2 ms −81%
Pure.ascii.length.decode.Text 15.6 ms 15.5 ms 264 ms 44.6 ms −83%
Pure.ascii.length.decode.LazyText 19.9 ms 19.5 ms 213 ms 44.7 ms −78%
Pure.english.decode.Text 245 μs 201 μs −17% 17.8 ms 2.30 ms −87%
Pure.english.decode.LazyText 807 μs 800 μs 14.4 ms 2.55 ms −82%
Pure.english.decode'.Text 242 μs 201 μs −16% 17.3 ms 2.30 ms −86%
Pure.english.decode'.LazyText 817 μs 801 μs 14.2 ms 2.53 ms −82%
Pure.english.length.decode.Text 916 μs 911 μs 18.0 ms 2.91 ms −83%
Pure.english.length.decode.LazyText 1.35 ms 1.35 ms 14.1 ms 3.01 ms −78%
Pure.russian.decode.Text 3.30 μs 3.35 μs 59.5 μs 20.4 μs −65%
Pure.russian.decode.LazyText 3.39 μs 3.37 μs 41.7 μs 20.4 μs −51%
Pure.russian.decode'.Text 3.31 μs 3.37 μs 60.1 μs 20.4 μs −66%
Pure.russian.decode'.LazyText 3.45 μs 3.42 μs 41.8 μs 20.4 μs −51%
Pure.russian.length.decode.Text 4.90 μs 4.97 μs 61.6 μs 21.9 μs −64%
Pure.russian.length.decode.LazyText 5.03 μs 5.00 μs 43.2 μs 22.0 μs −49%
Pure.japanese.decode.Text 3.53 μs 3.58 μs 59.0 μs 14.5 μs −75%
Pure.japanese.decode.LazyText 3.73 μs 3.71 μs 34.4 μs 14.2 μs −58%
Pure.japanese.decode'.Text 3.64 μs 3.69 μs 59.0 μs 14.5 μs −75%
Pure.japanese.decode'.LazyText 3.77 μs 3.74 μs 34.4 μs 14.6 μs −57%
Pure.japanese.length.decode.Text 5.32 μs 5.41 μs 60.8 μs 16.2 μs −73%
Pure.japanese.length.decode.LazyText 5.49 μs 5.44 μs 36.1 μs 16.2 μs −55%

Thanks for benchmarking @BurningWitness. Sorry, I'm extra busy this week, will take a look later.

@BurningWitness sorry again, I didn't forget about your work here, but still no time to dive in properly.