Segmentation fault in 2.11.0 from cdist with empty input

Question

Segmentation fault in 2.11.0 from cdist with empty input

peterjc opened this issue 2 years ago · comments

Seeing this on Windows, Linux, macOS, appears to affect both PyPI wheels and conda-forge packages.

My tool's test suite is working on 2.10.0, now segmentation fault on updating to 2.11.0 - most likely due to the cdist SIMD implementation work.

Minimal test case:

$ python -c "from rapidfuzz.process import cdist; from rapidfuzz.distance import Levenshtein; cdist([], ['a', 'b'], scorer=Levenshtein.distance)"

Assuming it is just triggered by an empty first argument, an end-user workaround ought not to be too painful - and fingers cross also easy to fix here?

Max Bachmann · Answer 1 · Tue Oct 04 2022 07:39:32 GMT+0800 (China Standard Time)

Thanks for reporting yet another issue 😅
This is caused by:
https://github.com/maxbachmann/RapidFuzz/blob/3ce69d8deced9ab96d4dcbf8a071b48f9742d8f4/src/rapidfuzz/process_cpp.hpp#L507

I fixed this in 017bc6e and released v2.11.1 with the fix.

Max Bachmann · Answer 2 · Tue Oct 04 2022 07:56:43 GMT+0800 (China Standard Time)

btw you seem to catch issues in rapidfuzz using your test suite quite often. Do you have any tests, that you think would make sense to integrate into rapidfuzz? I am always looking to increase my test suite.

Peter Cock · Answer 3 · Tue Oct 04 2022 18:46:47 GMT+0800 (China Standard Time)

Thanks! No other tests to suggest for now.

Sadly for now my tool doesn't have unit tests at the level of the Python functions, just high level expected output files for given input files. Since the last round of refactoring I think I'm just using cdist with the Levenshtein metric on batches of sequences (and a range of small cutoff scores), but previously was calling the distance function on pairs of sequences.

The sequences are mostly the letters A, C, G and T (in upper case) for DNA but with the occasional other letters used as ambiguity markers. i.e. This is a small subset of the potential characters people will be passing to RapidFuzz.

When I've run into issues with RapidFuzz, I've tried to include minimal test cases (like here) for your potential inclusion.

Max Bachmann · Answer 4 · Tue Oct 04 2022 19:04:49 GMT+0800 (China Standard Time)

Your minimal test cases are hugely helpful 👍