vickumar1981 / stringdistance

A fuzzy matching string distance library for Scala and Java that includes Levenshtein distance, Jaro distance, Jaro-Winkler distance, Dice coefficient, N-Gram similarity, Cosine similarity, Jaccard similarity, Longest common subsequence, Hamming distance, and more..

Home Page:https://vickumar1981.github.io/stringdistance/api/com/github/vickumar1981/stringdistance/index.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

longestCommonSeq implementation has exponential complexity

anschlechter opened this issue · comments

longestCommonSeq is not efficient, it takes a very long time if the strings to be compared have a large distance.

The implemented algorithm has exponential complexity O(2^(n+m)).

Expected Result

predictable performance. algorithm should have O(n*m) complexity as is possible.

c.f. https://www.techiedelight.com/longest-common-subsequence/

--> translated to SCALA:

def LCSLength(X: String, Y: String): Int = {
val m = X.length
val n = Y.length
// lookup table stores solution to already computed subproblems,
// i.e., T[i][j] stores the length of LCS of substring
// X[0…i-1] and Y[0…j-1]
val T = Array.ofDim[Int](m + 1, n + 1)
// fill the lookup table in a bottom-up manner
for (i <- 1 to m) {
for (j <- 1 to n) {
if (X.charAt(i - 1) == Y.charAt(j - 1)) {
// if the current character of X and Y matches
T(i)(j) = T(i - 1)(j - 1) + 1
}
else {
// otherwise, if the current character of X and Y don't match
T(i)(j) = Integer.max(T(i - 1)(j), T(i)(j - 1))
}
}
}
// LCS will be the last entry in the lookup table
T(m)(n)
}

Actual Result

"unpredictable" runtime

Reproduction Steps

Compare "RUE NELSON MANDELA" with "RUE ALBERT UNGEHEUER" -> it just takes way to long to find out that these are not the same.

@anschlechter thanks for reporting the issue.

the implementation used in the library for longestCommonSeq is here:
https://github.com/vickumar1981/stringdistance/blob/master/src/main/scala/com/github/vickumar1981/stringdistance/impl/LongestCommonSeqImpl.scala#L4
(fixed)

Will add that test case and compare implementations. 👍

@anschlechter published a 1.2.5-SNAPSHOT with your changes included here: #65

Sonyatype snapshot repo: https://oss.sonatype.org/content/repositories/snapshots/com/github/vickumar1981/

Testing on my end, it looks to be much faster :). Thanks for the great test cases and the implementation fix!

If you could/would possibly pull down or use the 1.2.5-SNAPSHOT and confirm that this fixes the issue, then I can merge these changes and include them in the 1.2.5 release on maven.

Thanks again! cheers.

Hi, I tried the 1.2.5-SNAPSHOT and it looks good to me. Thank you.

Addressed by: #66

Fixed in version 1.2.5. Published v1.2.5 artifact to sonatype and should be available on maven central shortly, once it syncs.