rapidfuzz / strsim-rs

:abc: Rust implementations of string similarity metrics

Home Page:https://crates.io/crates/strsim

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

please reconsider the addition of ndarray and hashbrown as dependencies

BurntSushi opened this issue · comments

I left a comment here on the commit in which these deps were introduced, but I'm not sure if it was seen: d6717db#r33186236

So I'd like to open an issue to increase its visibility. In particular, it would be great if strsim could remain fairly lightweight since it is depended on by a bunch of crates via clap. I'm hoping that we, as an ecosystem, can be a bit more judicious in adding dependencies to crates.

For hashbrown, which is awesome, it is going to be included in the standard library very soon. Folks upgrading their Rust version will automatically get this optimization soon enough. I don't think it's worth adding it as an explicit dependency for a small gain for users on older versions of Rust, because it impacts compilation times for everyone.

For ndarray, it looks like it was just used for some small convenience. I don't think it's worth bringing ndarray in (as excellent as it is) along with all of its dependencies just for that.

Hi, thanks for bringing this up. I did consider the usual trade offs with taking on dependencies. The performance gain seemed worth it to me, but I didn't know that hashbrown was going to be merged into the standard library until I saw it on Reddit recently. Based on that, I agree it doesn't need to be a dependency here.

For ndarray, that was laziness on my part. I should have separated it out to see if it alone would improve performance significantly. Based on the first comment in #32, it probably won't. I will try it for myself this weekend to verify. Apologies for not doing so before merging. After that, I will pull hashbrown out and ndarray as well, barring any surprising results from my testing.

@lovasoa, please let me know if you have any objections or concerns.

@BurntSushi, I'll try to be more diligent in the future, as I agree that lower level libraries should require a correspondingly higher bar for taking on dependencies.

Thank you so much! :-)

From my testing, it appears that using ndarray does result in a speedup of about 17%.

0.9.0:

test benches::bench_damerau_levenshtein            ... bench:      20,601 ns/iter (+/- 474)

with ndarray:

test benches::bench_damerau_levenshtein            ... bench:      17,160 ns/iter (+/- 576)

17% is a bigger value than I expected, but it still doesn't seem large enough to justify all the sub-dependencies, especially considering only one of the metrics is affected. I'm going to sleep on it.

Can you point me in the right direction to re-run those benchmarks? I can take a look. It is surprising to me that ndarray is responsible for that speedup.

A compromise between adding a dependency to ndarray and using a vector of vectors may be to use a single fixed size array, and make the necessary computations at array access time, replacing

distances[i][j]

by

distances[i*width + j]

@BurntSushi, you can run cargo +nightly bench in the ndarray branch that I just pushed up and compare it to 0.9.0.

@lovasoa, nice suggestion. I just tried using a single, flat vector (see the flatten-vectors branch), and it's as fast as using ndarray. Looks like that's the way to go, and we can drop ndarray while retaining the performance boost.

I suppose ndarray is doing something similar under the hood, rather than the naive, vector of vectors approach that I did.

Ok, I've published v0.9.2, which removes hashbrown and switches out ndarray for the one vector approach. Thanks everyone!

Thank you! :-)