Diff Calculating wrong results for certain cases

Question

Diff Calculating wrong results for certain cases

AjithAj2125 opened this issue a year ago · comments

I've been working and testing this algorithm (C#), It works as expected most times but there are few cases where the algorithm doesn't return correct results.

Example 1 :

The expected output for the above case would be Insert "o", Delete "n", Insert "n", Delete "o" but the output that the algo gives is wrong.

Example 2 :

The same issue arises here too, even though numbers are treated as strings here. The expected output would be Insert "3", Delete "0", Insert "1", Delete "3", Equal ".00"

While these are just some random test cases, I think the common pattern where this issue arises is when there are same characters in both the string which have been inserted and deleted or occur in different order.

Has this been observed by anyone and has this been addressed before? Can I get some guidance on how to handle these cases as I'm trying to utilize this algorithm.

Or am I not using it the right way? Any help would be appreciated.

Dennis Snell · Answer 1 · Wed Jul 12 2023 21:36:04 GMT+0800 (China Standard Time)

@AjithAj2125 I'm having trouble making sense of your expected outputs. It may help to review what the output of diff_main is providing: a way to move from oldstr to newstr.

In the case of no to on we can examine the output and reconstruct the diff.

start with no, index is at the start, at 0
delete the n, string now at o and index is still 0
equal is o, string still at o, but the index advanced the length of the equal segment, so it's 1
insert n, string is now on, and since we inserted, the index into the original string is the same.

now we're done and the transformed string is on, which is the newstr - exactly what we provided.

in other words the output is an edit script - a list of operations to apply to oldstr so that we end up with newstr after applying them.

the operations should always result in newstr, but it may be possible to find shorter edit scripts. for example, another legitimate output could be Diff(DELETE, "no"), Diff(INSERT, "on"). diff-match-patch attempts to quickly find an edit script that's small and also human-readable.

does this help clarify what the library is doing?

Ajith D · Answer 2 · Thu Jul 13 2023 03:01:00 GMT+0800 (China Standard Time)

@dmsnell thank you for the swift response. And that clears my doubt.

I had gotten the whole thing wrong. I had understood it as the algorithm giving an output for each edit that takes place for each of the strings positionally, which should explain what I was expecting as an output.

Just to make sense out of my "expected output".

Considering newstr = on and oldstr = no . Where each edit in newstr is happening in place of another character at the same position in oldstr

At 0th position
- o is inserted in the newstr
- n is deleted in the oldstr

Hence the output for first edit would be Insert o , Delete n

Moving to 1st position
- n is inserted in the newstr
- o is deleted in the oldstr

Hence the output for second edit would be Insert n, Delete o.

Combining both the outputs at each steps would result in my final output.

And as you mentioned earlier Diff(Insert, "no") and Diff(Delete, "on") is also another legit output.

I know what I'm expecting can be achieved using for loops and traversing the string simultaneously and finding the edits but I wasn't sure about how optimal it could have turned out given that I'm parsing huge chunks of texts. That's when I came across this algorithm that was readily available.

Dennis Snell · Answer 3 · Thu Jul 13 2023 03:39:49 GMT+0800 (China Standard Time)

how optimal it could have turned out

There's a complicated algorithm to find the minimum edit script and it has some catastrophic edge cases. Libraries like diff-match-patch take some shortcuts and give up getting the most "optimal" set of changes (which is another way of saying, the minimum number of operations to apply to the old string in order to reach the new string). What they get in return is more predictable performance and a reasonably-small edit script.

diff-match-patch actually de-optimizes some edits in order to make them more closely match the operations a human is doing when changing a text. That is, some outputs could be more "optimal" but obscure what was actually done to change the texts.