Shoobx / xmldiff

A library and command line utility for diffing xml

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

library doing multiple UpdateAttrib and one DeleteNode instead of just one DeleteNode

nicevibesplus opened this issue · comments

Hey, this is something I found out and I need a solution to get over with.

I am using the xmldiff python library to compare following xmls:

left.xml

<root>
<node id="1"/>
<node id="2"/>
<node id="3"/>
<node id="4"/>
<node id="5"/>
</root>

right.xml

<root>
<node id="1"/>
<node id="2"/>
<node id="4"/>
<node id="5"/>
</root>

this creates following differences:

UpdateAttrib(node='/root/node[3]', name='id', value='4')
UpdateAttrib(node='/root/node[4]', name='id', value='5')
DeleteNode(node='/root/node[5]')

Why would the library show such unnecessary steps and does not create something like this:

DeleteNode(node='/root/node[3]')

I am comparing with these parameters:

diffs = main.diff_trees(leftXML, rightXML, diff_options={"F":0.1, "ratio_mode":"accurate"})

Information and help appreciated.

nv+

I am seeing something very similar when I compare a tree with a tree that's identical except one sub-branch has been removed. There should only be DeleteNode entries in the diff but there are unnecessary MoveNode or Update* entries. This happens for any of the ratio modes.

Because the nodes are so similar that it decides that node 3 and node 4 is likely the same node, as they are extremely similar and in the same position. If you specify that id is uniquely identifying the nodes, then it behaves as you expect.

You also get the same effect if you increase -F to 0.8, in this case.

It would be possible to avoid this issue, if we first make a match of all nodes to all other nodes, to make sure we have identified the best possible match for all nodes, but that would make it unacceptably slow on large files. Such a mode could be implemented as an option, though, contributions are welcome.

Yesterday I had an idea of how to make a more careful matching algorithm that wasn't exponentially slower. It seems to be approximately half the speed of the normal matching algorithm

I released version 2.6b1 with a --best-match parameter, this solves the issue.