lemire / fastvalidate-utf-8

header-only library to validate utf-8 strings at high speeds (using SIMD instructions)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

fastvalidate-utf-8

Most strings online are in unicode using the UTF-8 encoding. Validating strings quickly before accepting them is important.

NOTE: The fastvalidate-utf-8 library is obsolete as of 2022: please adopt the simdutf library. It is much more powerful, faster and better tested.

Want a production-ready function?

The fastvalidate-utf-8 repository is for demonstration purposes.

If you want access to a fast validation function for production use, you can rely on the simdutf library. It is as simple as the following:

#include "simdutf.cpp"
#include "simdutf.h"

int main(int argc, char *argv[]) {
  const char *source = "1234";
  // 4 == strlen(source)
  bool validutf8 = simdutf::validate_utf8(source, 4);
  if (validutf8) {
    std::cout << "valid UTF-8" << std::endl;
  } else {
    std::cerr << "invalid UTF-8" << std::endl;
    return EXIT_FAILURE;
  }
}

See https://github.com/simdutf/

The simdutf library supports a wide-range of platforms and offers runtime dispatching as well as the most up-to-date algorithms.

Reference

How to use fastvalidate-utf-8?

This is a header-only C library to validate UTF-8 strings at high speeds using SIMD instructions. Specifically, this expects an x64 processor (capable of SSE instruction). It will not work currently on ARM processors. It is not meant to be used in production as-is. Please see the simdjson library and its corresponding simdjson::validate_utf8 function.

Quick usage:

make
./unit
./benchmark

Code usage:

  #include "simdutf8check.h"

  char * mystring = ...
  bool is_it_valid = validate_utf8_fast(mystring, thestringlength);

It should be able to validate strings using less than 1 cycle per input byte.

If you expect your strings to be plain ASCII, you can spend less than 0.1 cycles per input byte to check whether that is the case using the validate_ascii_fast function found in the simdasciicheck.h header. There are even faster functions like validate_utf8_fast_avx.

A modified version of this code improved the performance of Scylla.

Command-line tool

Adam Retter maintains a useful command-line tool related to this library.

Experimental results

On a Skylake processor, using GCC, we get:

$ ./benchmark
string size = 65536
We are feeding ascii so it is always going to be ok.
It favors schemes that skip ASCII characters.
validate_utf8(data, N)                                          :  1.256 cycles per operation (best)     1.316 cycles per operation (avg)
validate_utf8_fast(data, N)                                     :  0.704 cycles per operation (best)     0.706 cycles per operation (avg)
validate_utf8_fast_avx(data, N)                                 :  0.450 cycles per operation (best)     0.452 cycles per operation (avg)
validate_utf8_fast_avx_asciipath(data, N)                       :  0.088 cycles per operation (best)     0.091 cycles per operation (avg)
validate_ascii_fast(data, N)                                    :  0.082 cycles per operation (best)     0.084 cycles per operation (avg)
validate_ascii_fast_avx(data, N)                                :  0.050 cycles per operation (best)     0.074 cycles per operation (avg)
validate_ascii_nosimd(data, N)                                  :  0.104 cycles per operation (best)     0.106 cycles per operation (avg)
validate_ascii_nointrin(data, N)                                :  0.068 cycles per operation (best)     0.088 cycles per operation (avg)
validate_utf8_fast(data, N)                                      :  0.701 cycles per operation (best)     0.703 cycles per operation (avg)  (linux counter)
validate_ascii_fast(data, N)                                     :  0.083 cycles per operation (best)     0.085 cycles per operation (avg)  (linux counter)


string size (approx) = 65536
Producing random-looking UTF-8
validate_utf8(data, actualN)                                    :  10.967 cycles per operation (best)     11.005 cycles per operation (avg)
validate_utf8_fast(data, actualN)                               :  0.702 cycles per operation (best)     0.705 cycles per operation (avg)
validate_utf8_fast_avx(data, actualN)                           :  0.448 cycles per operation (best)     0.485 cycles per operation (avg)
validate_utf8_fast_avx_asciipath(data, actualN)                 :  0.480 cycles per operation (best)     0.594 cycles per operation (avg)

Thus, after rounding, it takes 0.7 cycles per input byte to validate UTF-8 strings.

In Go

There is an assembly wrapper in Go by Stuart Carnie.

ARM Neon and SSE4

Fast UTF-8 validation with range algorithm (NEON+SSE4)

License

This library is distributed under the terms of any of the following licenses, at your option:

About

header-only library to validate utf-8 strings at high speeds (using SIMD instructions)

License:Apache License 2.0


Languages

Language:C 99.4%Language:Makefile 0.6%