ashvardanian / StringZilla

Up to 10x faster strings for C, C++, Python, Rust, and Swift, leveraging NEON, AVX2, AVX-512, and SWAR to accelerate search, sort, edit distances, alignment scores, etc 🦖

Home Page:https://ashvardanian.com/posts/stringzilla/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add search/split iterators for Python

ashvardanian opened this issue · comments

In C++ we have special smart iterators for bulk search and split operations. They lazily report the matches, avoiding heap allocations for the array of match offsets.

For that, an arbitrary matcher (string / character / character set ; in normal / reverse order) is combined with search / split ranges. Similar functionality should be added in Python, where we currently materialize the matches into a "compressed" Strs object.

I'm very interested in contributing to this project as my first step into open-source. I believe I could start by addressing this issue. To clarify, are we aiming to replace the Strs type with something like StrIterator that yields strings lazily? As a first step, should I focus on modifying the split function to eliminate the use of realloc and ensure it returns an iterator instead? Any guidance on this would be greatly appreciated.

Hi @ghazariann! I don't think we should replace the Strs. We should keep both. The split should provide an iterator, which should if materialized, is converted to Strs. How does that sound?