Change Core.ComparisonSource.GetPathIndex() to return the Index inside ChildNodes instead of Children

Question

Change Core.ComparisonSource.GetPathIndex() to return the Index inside ChildNodes instead of Children

edxlhornung opened this issue 2 years ago · comments

edxlhornung commented 2 years ago

New Feature Proposal

Description

The path property inside the HtmlDiffer.compare() Diff nodes does not work for TextNode nodes.

Background

We are using this library to check the difference between 2 HTML elements and add style to those elements to show to users the changes between the 2. Since the nodes inside the diffs returned by HtmlDiffer don't refer to the original HtmlDocument passed to the compare method, I need to use the path property to traverse the original document in order to add style (A red strikethrough) to the nodes marked by diff as MissingNodeDiff.

The path property works fine for all HtmlElement nodes as it returns the correct index inside the list returned by Children. However, since TextNode are not present inside the Children property, I must use ChildNodes to access all of my TextNode. In this case, the path does not return the correct index for all of my TextNode.

Example

Here is the path returned by the diff node. I want to access the text(8) element inside the p(0) node. Accessing the p(0) element is not problem.

Here is what is returned by the Children property inside the p(0) node. We can see that no text nodes are present.

Here is what is returned by the ChildNodes property inside the p(0) node. We can see that there are text nodes however, although there is a TextNode at index 8, it is not the correct node. The correct node should be at index 10 (I compared the content of the node returned by diff and the content of all nodes inside ChildNodes).

Suggestion

I would suggest changing the current implementation of ComparisonSource.GetPathIndex() to:

private static int GetPathIndex(INode node)
        {
            var result = 0;
            var parent = node.Parent;
            if (parent is not null)
            {
                var childNodes = parent.ChildNodes;
                for (int index = 0; index < childNodes.Length; index++)
                {
                    if (ReferenceEquals(childNodes[index], node))
                        return index;
                }
            }
            throw new InvalidOperationException("Unexpected node tree state. The node was not found in its parents child nodes collection.");
        }

Also, not sure how it should be implemented, but the path property is a little hard to work with as it requires extracting the indexes from the path string with a regex.

Egil Hansen · Answer 1 · Thu Feb 24 2022 02:10:13 GMT+0800 (China Standard Time)

Thank you for the suggestion. It's been a while since I was knee deep in the lib, so cannot remember if there is a reason I did not use ChildNodes. Maybe @FlorianRappl has some insights here.

Either way, unfortunately I am very busy ATM at work, but if you want to experiment with this change yourself and make the suggested change, the current test suite should catch any regressions, as it's rather comprehensive.

Otherwise I'll be able to look at this at a later time.

edxlhornung · Answer 2 · Thu Feb 24 2022 03:51:10 GMT+0800 (China Standard Time)

Hi,

Thanks for the quick response. I'll definitely work on that and follow with a PR.

However, I will have to change the test suite for that method as it is designed to skip elements without children (ie: textnodes and paragraphs). I'll add a couple of test cases with TextNodes.

Florian Rappl · Answer 3 · Thu Feb 24 2022 06:23:24 GMT+0800 (China Standard Time)

ChildNodes contains all nodes (incl. comments, text nodes etc.) while Children only contains elements.

In a fixed match scenario you'd want to use ChildNodes, but if its about equivalent then Children and using something like InnerText for text nodes may be better. Reason is simple: Comments etc. don't matter and 2 text nodes may contain the same content as 1 text node. Furthermore, an equal comparison of two text nodes may be false, however, their actual output may be the same (e.g., due to processing the used spacing and special characters).