ashvardanian / StringZilla

Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging NEON, AVX2, AVX-512, and SWAR to accelerate search, sort, edit distances, alignment scores, etc 🦖

Home Page:https://ashvardanian.com/posts/stringzilla/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Case-insensitive Unicode manipulation

ashvardanian opened this issue · comments

Python strings offer a lot of powerful methods, such as:

  • isalnum, isalpha, isascii, isdecimal, isdigit, isspace, islower, isupper, istitle, isnumeric for checks.
  • lower and upper that copy the string.
  • casfold described in section 3.13 of the Unicode Standard.

There are very few C-level libraries that provide such functionality, and most of them are not characterized by speed. Covering a subset of that functionality in StringZilla makes sense.

Starting with v3, part of this functionality is already available for ASCII strings. Implementing the same for UTF8 would involve preparing huge dictionaries, and potentially designing some SIMD-friendly trie or automata. So we are not rushing those features for now.