ashvardanian / StringZilla

Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging SWAR and SIMD on Arm Neon and x86 AVX2 & AVX-512-capable chips to accelerate search, sort, edit distances, alignment scores, etc 🦖

Home Page:https://ashvardanian.com/posts/stringzilla/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add RegEx support

ashvardanian opened this issue · comments

If you plan to add support for regexes I strongly suggest using Hyperscan/Vectorscan as a back end. Note however that pure Hyperscan has limitations -- e.g., no capturing group support -- but there is an alternative implementation shipped alongside it called Chimera that offers full pcre-style regex support.

Like your library, HS does a lot of really cool stuff with SIMD to massively speed things up.

I'm very familiar with HS and have been using it for years, but the core idea of this library is to be as minimalistic as possible and to work on "all" platforms. HS implementation is quite long and complex, so I was considering a simpler hack to solve common RegEx tasks without the need for DFA or advanced SIMD.

Take a look at tre as well; much more compact than HS and NFA-based.

https://github.com/laurikari/tre/

It's used in the magic library (what powers the file command).

Oh and you might also consider supporting glob syntax. Full-bore regexes are insane but I could see a very high performance glob matching implementation being super useful in many cases.

@dmbaggett, that’s a great suggestion! Will do!