Case-insensitive Unicode manipulation
ashvardanian opened this issue · comments
Python strings offer a lot of powerful methods, such as:
isalnum
,isalpha
,isascii
,isdecimal
,isdigit
,isspace
,islower
,isupper
,istitle
,isnumeric
for checks.lower
andupper
that copy the string.casfold
described in section 3.13 of the Unicode Standard.
There are very few C-level libraries that provide such functionality, and most of them are not characterized by speed. Covering a subset of that functionality in StringZilla makes sense.
Starting with v3, part of this functionality is already available for ASCII strings. Implementing the same for UTF8 would involve preparing huge dictionaries, and potentially designing some SIMD-friendly trie or automata. So we are not rushing those features for now.