Diff Calculating wrong results for certain cases
AjithAj2125 opened this issue · comments
I've been working and testing this algorithm (C#), It works as expected most times but there are few cases where the algorithm doesn't return correct results.
Example 1 :
The expected output for the above case would be Insert "o", Delete "n", Insert "n", Delete "o"
but the output that the algo gives is wrong.
The same issue arises here too, even though numbers are treated as strings here. The expected output would be Insert "3", Delete "0", Insert "1", Delete "3", Equal ".00"
While these are just some random test cases, I think the common pattern where this issue arises is when there are same characters in both the string which have been inserted and deleted or occur in different order.
Has this been observed by anyone and has this been addressed before? Can I get some guidance on how to handle these cases as I'm trying to utilize this algorithm.
Or am I not using it the right way? Any help would be appreciated.
@AjithAj2125 I'm having trouble making sense of your expected outputs. It may help to review what the output of diff_main
is providing: a way to move from oldstr
to newstr
.
In the case of no
to on
we can examine the output and reconstruct the diff.
- start with
no
, index is at the start, at 0 - delete the
n
, string now ato
and index is still 0 - equal is
o
, string still ato
, but the index advanced the length of the equal segment, so it's1
- insert
n
, string is nowon
, and since we inserted, the index into the original string is the same.
now we're done and the transformed string is on
, which is the newstr
- exactly what we provided.
in other words the output is an edit script - a list of operations to apply to oldstr
so that we end up with newstr
after applying them.
the operations should always result in newstr
, but it may be possible to find shorter edit scripts. for example, another legitimate output could be Diff(DELETE, "no"), Diff(INSERT, "on")
. diff-match-patch
attempts to quickly find an edit script that's small and also human-readable.
does this help clarify what the library is doing?
@dmsnell thank you for the swift response. And that clears my doubt.
I had gotten the whole thing wrong. I had understood it as the algorithm giving an output for each edit that takes place for each of the strings positionally, which should explain what I was expecting as an output.
Just to make sense out of my "expected output".
Considering newstr = on
and oldstr = no
. Where each edit in newstr
is happening in place of another character at the same position in oldstr
- At
0th
positiono
is inserted in thenewstr
n
is deleted in theoldstr
Hence the output for first edit would be Insert o
, Delete n
-
Moving to
1st
positionn
is inserted in thenewstr
o
is deleted in theoldstr
Hence the output for second edit would be Insert n
, Delete o
.
Combining both the outputs at each steps would result in my final output.
And as you mentioned earlier Diff(Insert, "no")
and Diff(Delete, "on")
is also another legit output.
I know what I'm expecting can be achieved using for loops and traversing the string simultaneously and finding the edits but I wasn't sure about how optimal it could have turned out given that I'm parsing huge chunks of texts. That's when I came across this algorithm that was readily available.
how optimal it could have turned out
There's a complicated algorithm to find the minimum edit script and it has some catastrophic edge cases. Libraries like diff-match-patch
take some shortcuts and give up getting the most "optimal" set of changes (which is another way of saying, the minimum number of operations to apply to the old string in order to reach the new string). What they get in return is more predictable performance and a reasonably-small edit script.
diff-match-patch
actually de-optimizes some edits in order to make them more closely match the operations a human is doing when changing a text. That is, some outputs could be more "optimal" but obscure what was actually done to change the texts.