Blazingly fast API-compatible UTF-8 validation for Rust using SIMD extensions, based on the implementation from simdjson. Originally ported to Rust by the developers of simd-json.rs.
This software should be considered alpha quality and should not (yet) be used in production, though it has been tested with sample data as well as a fuzzer and there are no known bugs. It will be tested more rigorously before the first production release.
basic
API for the fastest validation, optimized for valid UTF-8compat
API as a fully compatible replacement forstd::str::from_utf8()
- Up to twenty times faster than the std library on non-ASCII, up to twice as fast on ASCII
- Up to 28% faster on non-ASCII input compared to the original simdjson implementation
- Supports AVX2 and SIMD implementations on x86 and x86-64. ARMv7 and ARMv8 neon support is planned
- Selects the fastest implementation at runtime based on CPU support
- Written in pure Rust
- No dependencies
- No-std support
- Falls back to the excellent std implementation if SIMD extensions are not supported
Add the dependency to your Cargo.toml file:
[dependencies]
simdutf8 = { version = "0.0.4" }
Use simdutf8::basic::from_utf8
as a drop-in replacement for std::str::from_utf8()
.
use simdutf8::basic::from_utf8;
println!("{}", from_utf8(b"I \xE2\x9D\xA4\xEF\xB8\x8F UTF-8!").unwrap());
If you need detailed information on validation failures, use simdutf8::compat::from_utf8
instead.
use simdutf8::compat::from_utf8;
let err = from_utf8(b"I \xE2\x9D\xA4\xEF\xB8 UTF-8!").unwrap_err();
assert_eq!(err.valid_up_to(), 5);
assert_eq!(err.error_len(), Some(2));
Use the basic
API flavor for maximum speed. It is fastest on valid UTF-8, but only checks
for errors after processing the whole byte sequence and does not provide detailed information if the data
is not valid UTF-8. simdutf8::basic::Utf8Error
is a zero-sized error struct.
The compat
flavor is fully API-compatible with std::str::from_utf8
. In particular, simdutf8::compat::from_utf8()
returns a simdutf8::compat::Utf8Error
, which has valid_up_to()
and error_len()
methods. The first is useful for verification of streamed data. The second is useful e.g. for replacing invalid byte sequences with a replacement character.
It also fails early: errors are checked on-the-fly as the string is processed and once
an invalid UTF-8 sequence is encountered, it returns without processing the rest of the data.
This comes at a performance penality compared to the basic
API even if the input is valid UTF-8.
The fastest implementation is selected at runtime using the std::is_x86_feature_detected!
macro unless the CPU
targeted by the compiler supports the fastest available implementation.
So if you compile with RUSTFLAGS="-C target-cpu=native"
on a recent x86-64 machine, the AVX 2 implementation is selected at
compile time and runtime selection is disabled.
For no-std support (compiled with --no-default-features
) the implementation is always selected at compile time based on
the targeted CPU. Use RUSTFLAGS="-C target-feature=+avx2"
for the AVX 2 implementation or RUSTFLAGS="-C target-feature=+sse4.2"
for the SSE 4.2 implementation.
If you want to be able to call A SIMD implementation directly, use the public_imp
feature flag. The validation
implementations are then accessible via simdutf8::(basic|compat)::imp::x86::(avx2|sse42)::validate_utf8()
.
If you are only processing short byte sequences (less than 64 bytes), the excellent scalar algorithm in the standard library is likely faster. If there is no native implementation for your platform (yet), use the standard library instead. Also, this library uses unsafe code which has not been battle-tested and should not (yet) be used in production.
The benchmarks have been done with criterion, the tables are created with critcmp. Source code and data are in the bench directory.
The name schema is id-charset/size. 0-empty is the empty byte slice, x-error/66536 is a 64KiB slice where the very first character is invalid UTF-8. All benchmarks were run on a laptop with an Intel Core i7-10750H CPU (Comet Lake) on Windows with Rust 1.51.0.
simdutf8 performs better except for inputs ≤ 64 bytes.
simdutf8 is faster than simdjson except for some crazy optimization by clang for the pure ASCII loop (to be investigated). simdjson is compiled using clang and gcc from MSYS.
There is a small performance penalty to continuously checking the error status while processing data, but detecting errors early provides a huge benefit for the x-error/66536 benchmark.
The implementation is similar to the one in simdjson except that it aligns reads to the block size of the SIMD extension, which leads to better peak performance compared to the implementation in simdjson. This alignment means that an incomplete block needs to be processed before the aligned data is read, which would lead to worse performance on short byte sequences. Thus, aligned reads are only used with 2048 bytes of data or more. Incomplete reads for the first unaligned and the last incomplete block are done in two aligned 64-byte buffers.
For the compat API we need to check the error buffer on each 64-byte block instead of just aggregating it. If an
error is found, the last bytes of the previous block are checked for a cross-block continuation and then
std::str::from_utf8()
is run to find the exact location of the error.
Care is taken that all functions are properly inlined up to the public interface.
- to the authors of simdjson for coming up with the high-performance SIMD implementation.
- to the authors of the simdjson Rust port who did most of the heavy lifting of porting the C++ code to Rust.
This code is made available under the Apache License 2.0.
It is based on code distributed with simd-json.rs, the Rust port of simdjson, which is dual-licensed under the MIT license and Apache 2.0 license.
simdjson itself is distributed under the Apache License 2.0.
- John Keiser, Daniel Lemire, Validating UTF-8 In Less Than One Instruction Per Byte, Software: Practice and Experience 51 (5), 2021