java-diff-utils / java-diff-utils

Diff Utils library is an OpenSource library for performing the comparison / diff operations between texts or some kind of data: computing diffs, applying patches, generating unified diffs or parsing them, generating diff output for easy future displaying (like side-by-side view) and so on.

Home Page:https://java-diff-utils.github.io/java-diff-utils/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Potential optimizations

Guillaume789 opened this issue · comments

Hi,

I have moved from the initial Google diff project to this one as I prefer the package naming convention and the class refactoring.
Also thank you for maintaining this library.

The Myers’s algorithm is considered the best for Diff calculation, however it has complexity O(NR) and relies on matrices in order to evaluate all the possibilities.

It is not adapted for large files and has performance issues.

  1. A possible first improvement could be to reduce the size of the 2 input string, by removing the common sequence at the start and at the end.

Example:
Identify the initial common prefix:
org.apache.commons.lang.StringUtils.getCommonPrefix(…)
Identify the final common prefix (reverse the string and get common prefix):
org.apache.commons.lang.StringUtils.reverse(…)
org.apache.commons.lang.StringUtils.getCommonPrefix(…)

Once the Delta are calculated, you could restore the initial and final Delta. (positions of existing Deltas need to be adjusted with the first common Delta)

  1. Another possible improvement could be to identify the longest common subsequence (ex: org.apache.commons.text.similarity.LongestCommonSubsequence) and to split the Strings according to this value.
    Then calculate the Delta on the splitted strings and aggregate them at the end by adjusting the positions and also adding the common longest Delta.

Optimization 1 and 2 can be applied in the same time.

Many thanks,
Guillaume

Below an interesting post for potential improvements:
https://denisbider.blogspot.com/2019/11/a-better-sliding-matrix-diff-algorithm.html

The above optimization will solve the performance issue reported by evgeny-pasynkov (post #124)

Hi,
Below an example of code to reduce the List before processing it.

`import java.util.List;

import com.github.difflib.DiffUtils;
import com.github.difflib.patch.AbstractDelta;
import com.github.difflib.patch.ChangeDelta;
import com.github.difflib.patch.Chunk;
import com.github.difflib.patch.DeleteDelta;
import com.github.difflib.patch.DeltaType;
import com.github.difflib.patch.EqualDelta;
import com.github.difflib.patch.InsertDelta;
import com.github.difflib.patch.Patch;

public class DiffUtils2 {

/**
 * Compute Patch
 * @param source
 * @param target
 * @return
 */
public static Patch<String> diff(List<String> source, List<String> target) {
	int lcp = getIndexLCP(source, target);
	int lcs = getIndexLCS(source, target, lcp);
	
	List<String> sourceOpt = source.subList(lcp, source.size()-lcs);
	List<String> targetOpt = target.subList(lcp, target.size()-lcs);

	// Get patch
	Patch<String> patchAdj = new Patch<String>();
	Patch<String> patch = null;

	if(lcp==source.size()) {
		// no difference found
		return patchAdj;
	} else {
		patch = DiffUtils.diff(sourceOpt, targetOpt);
	}
	
	// get position start
	int indexStart = 0;
	for(int i=0;i<lcp;i++) {
		indexStart += source.get(i).length()+1;	// add +1 for EOL  TODO: handle unix and Windows EOD  (\r\n and \n)
	}
	
	// adjust positions
    for (AbstractDelta<String> delta : patch.getDeltas()) {
    	Chunk<String> sourceAdj = new Chunk(lcp+delta.getSource().getPosition(), delta.getSource().getLines());
    	Chunk<String> targetAdj = new Chunk(lcp+delta.getTarget().getPosition(), delta.getTarget().getLines());
    	AbstractDelta<String> deltaAdj = buidDelta(delta.getType(), sourceAdj, targetAdj);
    	patchAdj.addDelta(deltaAdj);
	}
        
	return patchAdj;
}

/**
 * Build Delta
 * @param type
 * @param source
 * @param target
 * @return
 */
private static AbstractDelta<String> buidDelta(DeltaType type, Chunk<String> source, Chunk<String> target) {
	switch (type) {
	case EQUAL:
		return new EqualDelta<String>(source, target);
	case INSERT:
		return new InsertDelta<String>(source, target);
	case CHANGE:
		return new ChangeDelta<String>(source, target);
	case DELETE:
		return new DeleteDelta<String>(source, target);
	default:
		return null;
	}
}

/**
 * Get longest Common Prefix
 * @param source
 * @param target
 * @return
 */
private static int getIndexLCP(List<String> source, List<String> target) {
	int i=0;
	while(i<source.size() && i<target.size()) {
		if(!source.get(i).equals(target.get(i))) {
			break;
		}
		i++;
	}
	return i;
}

/**
 * Get longest Common Suffix
 * @param source
 * @param target
 * @param lcp
 * @return
 */
private static int getIndexLCS(List<String> source, List<String> target, int lcp) {
	int lcs = 0;
	int i=source.size()-1;
	int j=target.size()-1;
	int min = Math.max(0, lcp);
	
	while(i>=min && j>=min) {
		if(!source.get(i).equals(target.get(j))) {
			break;
		}
		i--;
		j--;
		lcs++;
	}
	return lcs;
}

}`

According to my tests, this updated class gives worst performances !
I will continue my investigations...

Looking into branch introduce-optimized-meayers-al ... you will find an optimized version of meyers which gives at least a 400 percent performance boost.

@wumpz When is the optimization going to get merged into master and released to Maven Central?

Sorry for not informing you: 2b02951e8989eb74b132753feb21f1269cbfbf63. . Since GitHub works well with those fixes and closes commit messages I guessed this merge would somehow link to your issue as well. My bet.

So this change is already included in master and part of the latest snapshot. Look as well into the new factory for the default algorithm. There you could completely replace the algorithm that is used by default. However, I did not yet put this new algorithm in the default place, because the diff results will be different and there is a lot of code out there that speculates on the "old" behavior.

So could you check it?

Looking into JGit it seems that for extreme diffs like 9000 lines to one line it will even not look into the diff itself but just construct a simple complete change (remove all, place one line). That gives at least in those cases a drastic performance gain.

@wumpz

Looking into JGit it seems that for extreme diffs like 9000 lines to one line it will even not look into the diff itself but just construct a simple complete change (remove all, place one line). That gives at least in those cases a drastic performance gain.

That sounds bad to me. I would make this behavior configurable.

On a related note: I am diffing a String that consists of a single line 600,000 characters long (the result of Arrays.deepToString() on a float[][][]) and as you can imagine this is extremely slow. To make matters worse, I am breaking down the String into codepoints and feeding in List<Integer> so per the profiler the majority of the time is spent invoking Integer.equals().

Out of the 600,000 characters, only a tiny minority has changed. I don't know how the diffing algorithm works but I am wondering if it would be quicker to diff two Strings of this length vs two List<Integer>. If this is true, I am wondering if it would be possible to divide the Strings into sub-sections, do a coarse-grained String.equals() and then only do a codepoint comparison in sections that contain a difference. This might be quicker than invoking Integer.equals() hundreds of thousands of times.

Thoughts?