library doing multiple UpdateAttrib and one DeleteNode instead of just one DeleteNode

Question

library doing multiple UpdateAttrib and one DeleteNode instead of just one DeleteNode

nicevibesplus opened this issue 4 years ago · comments

Hey, this is something I found out and I need a solution to get over with.

I am using the xmldiff python library to compare following xmls:

left.xml

<root>
<node id="1"/>
<node id="2"/>
<node id="3"/>
<node id="4"/>
<node id="5"/>
</root>

right.xml

<root>
<node id="1"/>
<node id="2"/>
<node id="4"/>
<node id="5"/>
</root>

this creates following differences:

UpdateAttrib(node='/root/node[3]', name='id', value='4')
UpdateAttrib(node='/root/node[4]', name='id', value='5')
DeleteNode(node='/root/node[5]')

Why would the library show such unnecessary steps and does not create something like this:

DeleteNode(node='/root/node[3]')

I am comparing with these parameters:

diffs = main.diff_trees(leftXML, rightXML, diff_options={"F":0.1, "ratio_mode":"accurate"})

Information and help appreciated.

nv+

Robert Clewley · Answer 1 · Thu Jul 08 2021 22:36:00 GMT+0800 (China Standard Time)

I am seeing something very similar when I compare a tree with a tree that's identical except one sub-branch has been removed. There should only be DeleteNode entries in the diff but there are unnecessary MoveNode or Update* entries. This happens for any of the ratio modes.

Lennart Regebro · Answer 2 · Wed Jan 11 2023 22:08:56 GMT+0800 (China Standard Time)

Because the nodes are so similar that it decides that node 3 and node 4 is likely the same node, as they are extremely similar and in the same position. If you specify that id is uniquely identifying the nodes, then it behaves as you expect.

You also get the same effect if you increase -F to 0.8, in this case.

It would be possible to avoid this issue, if we first make a match of all nodes to all other nodes, to make sure we have identified the best possible match for all nodes, but that would make it unacceptably slow on large files. Such a mode could be implemented as an option, though, contributions are welcome.

Lennart Regebro · Answer 3 · Thu Jan 12 2023 18:04:25 GMT+0800 (China Standard Time)

Yesterday I had an idea of how to make a more careful matching algorithm that wasn't exponentially slower. It seems to be approximately half the speed of the normal matching algorithm

I released version 2.6b1 with a --best-match parameter, this solves the issue.