Severe performance problem
evgeny-pasynkov opened this issue · comments
Describe the bug
The implementation works 500x slower than Myers algorithm implementation from jgit
To Reproduce
Consider the old file is 1-line, and the new file is 90k different lines. Here is sample program in Kotlin which demonstrates the difference:
import com.github.difflib.DiffUtils
import com.github.difflib.algorithm.myers.MyersDiff
import org.eclipse.jgit.diff.DiffAlgorithm
import java.util.function.BiPredicate
fun main() {
val old = ArrayList<String>().apply {
add("abcd")
}
val new = ArrayList<String>().apply {
repeat(90_000) { i ->
add(i.toString())
}
}
run {
println("Start difflib Myers")
val start = System.currentTimeMillis()
val diff = DiffUtils.diff(old, new, MyersDiff(BiPredicate { ai, bi ->
ai == bi
}))
val end = System.currentTimeMillis()
println("Finished in ${end - start}ms and resulted ${diff.deltas.size} deltas")
}
run {
class Seq(val data: List<String>) : org.eclipse.jgit.diff.Sequence() {
override fun size(): Int = data.size
}
class SeqCmp : org.eclipse.jgit.diff.SequenceComparator<Seq>() {
override fun equals(a: Seq, ai: Int, b: Seq, bi: Int): Boolean = a.data[ai] == b.data[bi]
override fun hash(seq: Seq, ptr: Int): Int = seq.data[ptr].hashCode()
}
println("Start jgit Myers")
val start = System.currentTimeMillis()
val diff = DiffAlgorithm.getAlgorithm(DiffAlgorithm.SupportedAlgorithm.MYERS)
.diff(SeqCmp(), Seq(old), Seq(new))
val end = System.currentTimeMillis()
println("Finished in ${end - start}ms and resulted ${diff.size} deltas")
}
}
this program prints (for sure exact numbers will differ on exact computer)
Start difflib Myers
Finished in 46187ms and resulted 1 deltas
Start jgit Myers
Finished in 115ms and resulted 1 deltas
Interesting. Since I maintain this project I introduced no changes to the underlying algorithm itself
except some interface things. I will look into this.
After some digging: java-diff-utils use the original algorithm of Meyers paper while jgit uses an optimized version with linear time and searching from both sides (forward, backward). I think the proper way is to reimplement this algorithm. (https://blog.robertelder.org/diff-algorithm/). Whats realy interesting here is that the running time drops from O((len(original) + len(revised)) * D) to O(min(len(original),len(revised)) * D) (D is number of deltas). For your example this results in a huge performance gain.
Looking into branch introduce-optimized-meayers-al ... you will find an optimized version of meyers which gives at least a 400 percent performance boost.
The thing of the massive performance gain using JGit is that is not using a Meyers type of algorithm but some histogramm version of a diff algorithm.