java-diff-utils / java-diff-utils

Diff Utils library is an OpenSource library for performing the comparison / diff operations between texts or some kind of data: computing diffs, applying patches, generating unified diffs or parsing them, generating diff output for easy future displaying (like side-by-side view) and so on.

Home Page:https://java-diff-utils.github.io/java-diff-utils/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Severe performance problem

evgeny-pasynkov opened this issue · comments

Describe the bug
The implementation works 500x slower than Myers algorithm implementation from jgit

To Reproduce
Consider the old file is 1-line, and the new file is 90k different lines. Here is sample program in Kotlin which demonstrates the difference:

import com.github.difflib.DiffUtils
import com.github.difflib.algorithm.myers.MyersDiff
import org.eclipse.jgit.diff.DiffAlgorithm
import java.util.function.BiPredicate

fun main() {
    val old = ArrayList<String>().apply {
        add("abcd")
    }
    val new = ArrayList<String>().apply {
        repeat(90_000) { i ->
            add(i.toString())
        }
    }


    run {
        println("Start difflib Myers")
        val start = System.currentTimeMillis()
        val diff = DiffUtils.diff(old, new, MyersDiff(BiPredicate { ai, bi ->
            ai == bi
        }))
        val end = System.currentTimeMillis()
        println("Finished in ${end - start}ms and resulted ${diff.deltas.size} deltas")
    }


    run {
        class Seq(val data: List<String>) : org.eclipse.jgit.diff.Sequence() {
            override fun size(): Int = data.size
        }

        class SeqCmp : org.eclipse.jgit.diff.SequenceComparator<Seq>() {
            override fun equals(a: Seq, ai: Int, b: Seq, bi: Int): Boolean = a.data[ai] == b.data[bi]
            override fun hash(seq: Seq, ptr: Int): Int = seq.data[ptr].hashCode()
        }

        println("Start jgit Myers")
        val start = System.currentTimeMillis()
        val diff = DiffAlgorithm.getAlgorithm(DiffAlgorithm.SupportedAlgorithm.MYERS)
            .diff(SeqCmp(), Seq(old), Seq(new))
        val end = System.currentTimeMillis()
        println("Finished in ${end - start}ms and resulted ${diff.size} deltas")
    }

}

this program prints (for sure exact numbers will differ on exact computer)

Start difflib Myers
Finished in 46187ms and resulted 1 deltas
Start jgit Myers
Finished in 115ms and resulted 1 deltas

Interesting. Since I maintain this project I introduced no changes to the underlying algorithm itself
except some interface things. I will look into this.

After some digging: java-diff-utils use the original algorithm of Meyers paper while jgit uses an optimized version with linear time and searching from both sides (forward, backward). I think the proper way is to reimplement this algorithm. (https://blog.robertelder.org/diff-algorithm/). Whats realy interesting here is that the running time drops from O((len(original) + len(revised)) * D) to O(min(len(original),len(revised)) * D) (D is number of deltas). For your example this results in a huge performance gain.

Looking into branch introduce-optimized-meayers-al ... you will find an optimized version of meyers which gives at least a 400 percent performance boost.

The thing of the massive performance gain using JGit is that is not using a Meyers type of algorithm but some histogramm version of a diff algorithm.