rapidfuzz / RapidFuzz

Rapid fuzzy string matching in Python using various string metrics

Home Page:https://rapidfuzz.github.io/RapidFuzz/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Segmentation fault in 2.11.0 from cdist with empty input

peterjc opened this issue · comments

Seeing this on Windows, Linux, macOS, appears to affect both PyPI wheels and conda-forge packages.

My tool's test suite is working on 2.10.0, now segmentation fault on updating to 2.11.0 - most likely due to the cdist SIMD implementation work.

Minimal test case:

$ python -c "from rapidfuzz.process import cdist; from rapidfuzz.distance import Levenshtein; cdist([], ['a', 'b'], scorer=Levenshtein.distance)"

Assuming it is just triggered by an empty first argument, an end-user workaround ought not to be too painful - and fingers cross also easy to fix here?

Thanks for reporting yet another issue 😅
This is caused by:
https://github.com/maxbachmann/RapidFuzz/blob/3ce69d8deced9ab96d4dcbf8a071b48f9742d8f4/src/rapidfuzz/process_cpp.hpp#L507

I fixed this in 017bc6e and released v2.11.1 with the fix.

btw you seem to catch issues in rapidfuzz using your test suite quite often. Do you have any tests, that you think would make sense to integrate into rapidfuzz? I am always looking to increase my test suite.

Thanks! No other tests to suggest for now.

Sadly for now my tool doesn't have unit tests at the level of the Python functions, just high level expected output files for given input files. Since the last round of refactoring I think I'm just using cdist with the Levenshtein metric on batches of sequences (and a range of small cutoff scores), but previously was calling the distance function on pairs of sequences.

The sequences are mostly the letters A, C, G and T (in upper case) for DNA but with the occasional other letters used as ambiguity markers. i.e. This is a small subset of the potential characters people will be passing to RapidFuzz.

When I've run into issues with RapidFuzz, I've tried to include minimal test cases (like here) for your potential inclusion.

Your minimal test cases are hugely helpful 👍