Shoobx / xmldiff

A library and command line utility for diffing xml

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Store original element (attributes?) on the change objects

altendky opened this issue · comments

It would be helpful to keep the original element, or at least its direct attributes, with the change objects. The diffing I just needed to do included ignoring various differences that are allowable. In this case just ignoring some of the generated diff items is sufficient for now (only UpdateAttrib so no issue even if interpreting moves etc in the modified diff). This saves me from considering how to embed this flexibility into the actual diff engine.

The change objects do include the node xpath for the left side but I suspect that can't be directly used since it would presumably be wrong if a previous action in the diff required a move or delete etc that affected the index of the present change node.

Would a PR be considered for this? I do see that the elements are mutated while building the diff but I didn't dig past that to see how I might capture the original element, or at least its attributes.

In case I'm heading down the wrong path, below are couple example elements I would like to end up either matching or ignoring the resulting diff of. The logic is that if the type has the value "42" then a modification of access from"r" to "rw" is allowed.

<point access="r" type="42">
<point access="rw" type="42">

What about attrs instead of namedtuples? Aside from generally liking attrs it came to my mind as I thought about adding the extra field but probably not wanting to change the repr. I'm used to attrs but I wondered what it would look like with namedtuple so I went ahead and made up a comparison. Obviously I've already got the code for namedtuple now so... whatever, but I figured I'd ask anyways.

Cheers,
-kyle

`namedtuple` and attrs repr example
import collections

import attr


# ------ namedtuple

X = collections.namedtuple('X', 'this that')
def custom_repr(self):
     return '{}({})'.format(
        type(self).__name__,
        ', '.join(
            '{}={}'.format(
                field,
                getattr(self, field)
            )
            for field in self._fields
            if field not in ('that',)
        )
    )
X.__repr__ = custom_repr

x = X(1, 2)
print(x)


# ------ attrs

@attr.s
class Y:
    this = attr.ib()
    that = attr.ib(repr=False)

y = Y(1, 2)
print(y)

Storing the node wouldn't work, as the node is being modified while diffing. Storing a copy would increase memory requirements a lot.

In the case of UpdateAttribute, storing the original value is possible, but it becomes a strange special case. I'll have to think about that.

The NamedTuples at the moment act only as "storage", they don't have and do not need any additional functionality, so using attrs doesn't give us any benefit, except longer code and another dependency. ;-)

I do understand that I am working on a special case and we wouldn't want to infect anything with a custom mess. Though I wonder if diff-with-exceptions should actually be uncommon.

Yes, I ran into the modification issue. As to memory, this feature could be optional, otherwise just store None. I mentioned attrs because once you go defining a custom repr to not show this extra reference info the code no longer is shorter. :]

But, perhaps the solution lies more in #10? If while walking the list of changes I could apply them to (a copy of) my original left then I could reliably use the .node xpath to get the 'original' element. Albeit maybe with modified attributes, though I could skip attribute modification easily enough when applying the diff.

What about a map from the being-modified tree elements back to the actual original element in the passed-in tree? That wouldn't cost much. Just a reference and a namedtuple element in the change and the mapping dict can't be big compared to two entire trees in memory. Though I haven't looked into how xmldiff actually does the copying and modifications (beyond the one line assigning to attributes).

But maybe I misread and the entire thing passed in is mutated and the caller must have made a copy if they didn't want the diff to destroy their data...

No, xmldiff will make a copy.

We can store the original value on UpdateAttrib, UpdateTextIn and UpdateTextAfter. Original names for RenameNode and RenameAttrib also makes sense I guess.

Thanks for the clarification. In my case I need to check a different attribute, not the one described by the UpdateAttrib change. I was thinking storing a reference to the actual original element on which the attribute being updated exists. I have some discomfort about creating this massive pile of references to an 'external' element tree which leaves me still wondering if this should be an optional feature, but I'm not certain my discomfort is well founded.

Ah, yes, you are right, I get what you mean. I wonder if some sort of pre-stage where the data is massaged could be an option?

I don't want to make copies of every element when diffing, and if there was such a copy, then the next case is that a change is allowed if a sibling-element is an element set, etc. :-)

So the generic way to support this is indeed to make it easier to iterate over the changes with a current tree, ie issue #10 as you say.

I don't think what I suggested requires any more copying than normal. Assuming I understood correctly that the two trees passed in to xmldiff are not mutated. If so, they are the original and you hold a reference to the related element(s?) in them. The only additional storage is the attribute on the change object holding the reference. The actual element already exists. Doing this gives full context on the original left side from the point of the change.

But there may be something I'm missing about the diff process. It may be needed to make a copied-element-to-be-modified-by-diff -> original-element mapping to make setting the proper original element on the change easy.

I'm willing to try a PR on this but I like to have a sense I'm going a plausible direction first.

The xmldiff.main.diff_files() parameters are not mutated? If true then there's no need to copy anything. Just reference the original elements in the original passed in tree(s)? It's an extra attribute on each change object, no extra objects.

I can plan to do a PR for this, but I do like to make sure I'm heading a sensible direction.

I think a possible way to do this could be with events. You wouldn't get a reference to the original element, but to the current element in the left tree, that would be doable.

Perhaps I just need to dig into the code. It seems I'm missing something important about the implementation. Thanks for all the discussion, I'll see what I can learn.

Since you mentioned issue #10, I just wanted to note that I have added a tool to patch trees. Not sure it would help in this case, though. I'm leaning more towards this being a case for massaging the data ahead of the diff.