rapidfuzz / strsim-rs

:abc: Rust implementations of string similarity metrics

Home Page:https://crates.io/crates/strsim

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add a `levenshtein_limit` function

tgross35 opened this issue · comments

A lot of times when checking if two strings are similar, they wind up being more similar than you care to actually check. For these cases it can be a significant performance boost to give up calculating the levenshtein distance at a specific limit (rather than calculating the complete distance and then clamping).

Would it be possible to add this feature to this crate?

I have a significantly less popular crate with these, I would be okay with the implementation being stolen :) https://docs.rs/stringmetrics/latest/stringmetrics/#limited-levenshtein-algorithm / https://docs.rs/stringmetrics/latest/stringmetrics/fn.try_levenshtein.html. (A missed optimization here is that you don't actually need the full-length vector if a limit is provided. Could actually be a const generic).

strsim-rs is intended for people who care about binary size, but not that much about performance. For this reason I do not think this will be added here.

However I have a performance focused implementation in https://github.com/rapidfuzz/rapidfuzz-rs which includes a significant amount of optimizations:

  • uses bitparallel implementation which allows calculating the distance for 64 characters in parallel (Myer / Hyrro)
  • remove common prefix + postfix before calculating the similarity
  • uses limit to exit early (e.g. using length differences)
  • uses limit to calculate only parts of the levenshtein matrix (Ukkonens optimization)
  • uses different implementations for text lengths limits < 4, <64 characters, ukkonen bands < 32 characters, longer texts for optimal performance.
  • for weighted version use bitparallel algorithm for the Uniform Levenshtein distance and Indel distance and fall back to wagner fischer for different weights
  • provide a optimized version for cases where you compare one text to multiple texts, which is able to cache some constant parts of the calculation.

The C++ version includes further things not implemented in rust so far:

  • backtrace edit operations using combination of bitparallel implementation and hirschbergs algorithm to calculate it fast while still having a memory usage <= 1mb
  • simd implementation which allows comparing multiple short texts in parallel

https://github.com/rapidfuzz already mentions the different goals of the projects:

  • rapidfuzz-rs is intended for users who care about performance
  • strsim-rs is intended for users who care more about binary size (still includes a lot of optimization as long as they don't affect binary size). This is useful e.g for cli tools like clap which only use the text matching to provide cli recommendations

The readmes of both projects should be updated to tell users about the difference and guide the to the one better suited for their use case.

Easy enough, thanks for the reply!