Severe performance problem

Question

Severe performance problem

evgeny-pasynkov opened this issue 3 years ago · comments

Describe the bug
The implementation works 500x slower than Myers algorithm implementation from jgit

To Reproduce
Consider the old file is 1-line, and the new file is 90k different lines. Here is sample program in Kotlin which demonstrates the difference:

import com.github.difflib.DiffUtils
import com.github.difflib.algorithm.myers.MyersDiff
import org.eclipse.jgit.diff.DiffAlgorithm
import java.util.function.BiPredicate

fun main() {
    val old = ArrayList<String>().apply {
        add("abcd")
    }
    val new = ArrayList<String>().apply {
        repeat(90_000) { i ->
            add(i.toString())
        }
    }


    run {
        println("Start difflib Myers")
        val start = System.currentTimeMillis()
        val diff = DiffUtils.diff(old, new, MyersDiff(BiPredicate { ai, bi ->
            ai == bi
        }))
        val end = System.currentTimeMillis()
        println("Finished in ${end - start}ms and resulted ${diff.deltas.size} deltas")
    }


    run {
        class Seq(val data: List<String>) : org.eclipse.jgit.diff.Sequence() {
            override fun size(): Int = data.size
        }

        class SeqCmp : org.eclipse.jgit.diff.SequenceComparator<Seq>() {
            override fun equals(a: Seq, ai: Int, b: Seq, bi: Int): Boolean = a.data[ai] == b.data[bi]
            override fun hash(seq: Seq, ptr: Int): Int = seq.data[ptr].hashCode()
        }

        println("Start jgit Myers")
        val start = System.currentTimeMillis()
        val diff = DiffAlgorithm.getAlgorithm(DiffAlgorithm.SupportedAlgorithm.MYERS)
            .diff(SeqCmp(), Seq(old), Seq(new))
        val end = System.currentTimeMillis()
        println("Finished in ${end - start}ms and resulted ${diff.size} deltas")
    }

}

this program prints (for sure exact numbers will differ on exact computer)

Start difflib Myers
Finished in 46187ms and resulted 1 deltas
Start jgit Myers
Finished in 115ms and resulted 1 deltas

Tobias · Answer 1 · Thu Jun 03 2021 13:59:59 GMT+0800 (China Standard Time)

Interesting. Since I maintain this project I introduced no changes to the underlying algorithm itself
except some interface things. I will look into this.

Tobias · Answer 2 · Thu Jun 03 2021 15:28:20 GMT+0800 (China Standard Time)

After some digging: java-diff-utils use the original algorithm of Meyers paper while jgit uses an optimized version with linear time and searching from both sides (forward, backward). I think the proper way is to reimplement this algorithm. (https://blog.robertelder.org/diff-algorithm/). Whats realy interesting here is that the running time drops from O((len(original) + len(revised)) * D) to O(min(len(original),len(revised)) * D) (D is number of deltas). For your example this results in a huge performance gain.

Tobias · Answer 3 · Sun Aug 15 2021 05:23:14 GMT+0800 (China Standard Time)

Looking into branch introduce-optimized-meayers-al ... you will find an optimized version of meyers which gives at least a 400 percent performance boost.

The thing of the massive performance gain using JGit is that is not using a Meyers type of algorithm but some histogramm version of a diff algorithm.