LifeIsStrange / utf8.java

Vectorized UTF-8 Validation for Java

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

utf8.java

Vectorized UTF-8 validation & benchmarks, written in Java.

Based on the paper by John Keiser and Daniel Lemire, with minor modifications.

Verify Correctness

Make sure to have Java 17 or greater installed. Then execute:

mvn compile assembly:single && \
java --enable-preview --add-modules jdk.incubator.vector \
-jar target/utf8.java-1.0-SNAPSHOT-jar-with-dependencies.jar [optional list of space-delineated file paths]

With no arguments, this will run the UTF-8 validator on 4 source files:

  • twitter.json: 631_515 bytes, frequent multi-byte characters. Taken from here.
  • utf8-demo.txt: 13_459 bytes, many special / tricky utf8 characters. From w3.org.
  • utf8-demo-invalid.txt: same as utf8-demo.txt, but with one error.
  • 20k.txt: 3.8 MB, all ascii.

Running Benchmarks

mvn verify && java -jar target/benchmarks.jar

The JMH benchmarks use the same 4 test files mentioned above, at 3 vector lengths: 128 bit, 256, and 512. Most likely your hardware does not support 512 bit vectors, so these benchmarks fallback to the slow array-based implementation.jdk_decode uses the JDK's new String(buf, UTF_8). This constructor produces a new String in addition to validation, but is good enough for a baseline.

Performance

Throughput for twitter.json:

new String(buf, UTF_8) Utf8.validate(buf, new LookupTables256()) simdjson::validate_utf8(str, len)
.8 GB/sec 3.52 GB/sec 24 GB/sec
  • The JDK algorithm is very optimized, and uses intrinsics to check negatives (for the ASCII shortcut) and to elide array bound checks.

  • In the vectorized algorithm, 256 bit vectors currently perform best. We cannot go smaller than 128 bit, since nibbles (4 bits) are used to select from the lookup tables.

JDK 16 (16 March 2021) vs Latest Panama Build (13 September 2021)

256-bit-jdk-comparison

In 181 days, performance has increased up to 127 fold!

128-bit-jdk-comparison

Conclusion

  • The Vector api is expressive and a pleasure to use. Performance is getting better.
  • Abstracting over ISA and even vector Shape is incredible for portability, given how fragmented vector instruction sets are.
  • The dissonance between Vector and ByteVector is a little annoying.
  • If I had benchmarked iteratively while developing, I could've discovered the causes of slowdown sooner.
  • I wish there was a (documented) debug/logging flag.
  • Vector::selectInto is awesome for lookup tables.
  • The project's JavaDoc is one of the best introductions to vectorization on the internet.
  • Debugging works great! I do wish we could make Vector::toString print hex instead of base 10 by default.
  • Would have been nice if performance was a success story, but failure is educational.
  • I think someone could implement simd-json in Java if they wanted to. Would it be fast? At least not for now.

About

Vectorized UTF-8 Validation for Java


Languages

Language:Java 100.0%