vickumar1981 / stringdistance

A fuzzy matching string distance library for Scala and Java that includes Levenshtein distance, Jaro distance, Jaro-Winkler distance, Dice coefficient, N-Gram similarity, Cosine similarity, Jaccard similarity, Longest common subsequence, Hamming distance, and more..

Home Page:https://vickumar1981.github.io/stringdistance/api/com/github/vickumar1981/stringdistance/index.html

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The method is bug -> CommonStringDistanceAlgo.getCommonChars

brainliu81 opened this issue · comments

def getCommonChars(s1: String, s2: String, halfLen: Int): String = {
val commonChars = new StringBuilder()
val strCopy = new StringBuilder(s2)
var n = s1.length
val m = s2.length
s1.zipWithIndex.foreach{
case (ch, chIndex) => {
var foundIt = false
var j = math.max(0, chIndex - halfLen)
while (!foundIt && j <= Math.min(chIndex + halfLen, m - 1)) {
if (strCopy(j) == ch) {
foundIt = true
commonChars.append(ch)
strCopy.setCharAt(j, '\0')
}
j += 1
}
}}
commonChars.toString
}

@brainliu81 do you have a test case that i can use to debug/fix this? thanks. will take a look into this. apologies for the late response.

This function is used in the jaro and jaroWinkler implementations.

https://github.com/vickumar1981/stringdistance/blob/master/src/main/scala/com/github/vickumar1981/stringdistance/impl/JaroImpl.scala#L16

Are those implementations not providing you a correct score for a pair of known values?

for example, given the two strings "MARTHA" and "MARHTA" . the jaro score should be 0.944 and the jaro-winkler score ought to be 0.961. I double-checked most of my test cases using this site here: https://asecuritysite.com/forensics/simstring