CADbloke / daisydiff

Automatically exported from code.google.com/p/daisydiff

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Invalid HTML output

GoogleCodeExporter opened this issue · comments

What steps will reproduce the problem?
1. Create 2 identic HTML file that contain either of:
-- a list (ordered or unordered)
-- a table
2. delete a word from that element and change the style of the word's 
immediate parent in the "new" file
3. watch the result

E.g.:
<ul>
  <li>Some check</li>
</ul>
transformed into
<ul>
  <li style="text-decoration:underline">check</li>
</ul>
will result in the html where "Some" is placed between <ul> and <li>

What is the expected output? 
valid HTML
What do you see instead?
immediate text content in tags that do not permit it 
(<tr>, <table>, <ul>, <ol>, <dl>)
causes the deleted text to be displayed outside of the parent (and common 
parent) HTML element.

What version of the product are you using? 
1.0
On what operating system?
Window XP


Please provide any additional information below.

I have a question - what was the idea with the splitting? 
Was it meant as followed?
>>>
Because the parent of the deleted text node doesn't exist anymore (though 
it might not be true), but the surrounding text nodes did have common 
ancestor, we should add the "missing" child that contained the deleted 
node previously
<<<

I will try to fix this, but I would like to know the answer to the 
question above, so I won't break anything accidentally.

The place where the bug is happening:
org.outerj.daisy.diff.html.TextNodeComparator.markAsDeleted(...)
the code goes like the following:
...
           if (prevResult.getLastCommonParentDepth() > 
               nextResult.getLastCommonParentDepth()) {

                // Inserting at the front
                if (prevResult.isSplittingNeeded()) {
                    prevLeaf.getParent().splitUntill(
                            prevResult.getLastCommonParent(), 
                            prevLeaf,
                            true);
                }
                prevLeaf = deletedNodes.remove(0).copyTree();
                //this is the bug, as not all common parents allowed
                //to have text content. This causes the deleted text to 
                //be displayed outside of the structure. 
                //The splitting is never used here.
                prevLeaf.setParent(prevResult.getLastCommonParent());
                prevResult.getLastCommonParent().addChild(
                        prevResult.getIndexInLastCommonParent() + 1,
                        prevLeaf);

            } else if (prevResult.getLastCommonParentDepth() < 
                       nextResult.getLastCommonParentDepth()) {
 ...
-------------------------
attached are the samples of the bug (notice samples validate as strict 
html: http://validator.w3.org/)

Original issue reported on code.google.com by anastass...@businesswire.com on 14 Apr 2009 at 6:21

Attachments:

uhm, this bug happens if the element that had his style changed and some text 
deleted was a FIRST child or its parent.

That is if this is a first cell in a table row, or a first list item in a list.

In the case where it is not a first child, the deleted text sometimes is placed 
in 
the previous child (if the previous child attributes are the same as they were 
for 
the child with the deleted text in the "old version")

Original comment by anastass...@businesswire.com on 14 Apr 2009 at 6:51

  • Added labels: ****
  • Removed labels: ****
The reason for this splitting is that "Some" never was in an underlined <li>, 
so it's
not honest to render it underlined. Ofcourse this is what causes all the 
problems.

The reason why "Some" isn't in a <li> is that the <li> it stems from is not
completely copied to the new document. DaisyDiff then assumes that the <ul>, 
the next
best thing, is the most accurate place to insert it.

Some people have suggested to underline the "Some" anyway, maybe annotating it.
Alternatively, we could try to make the splitting smarter, by checking if the 
parent
tag can have the child or by forcing in an <li> every time we want to add text 
to a <ul>.

Original comment by guy...@gmail.com on 17 Apr 2009 at 10:37

  • Changed state: Accepted
  • Added labels: ****
  • Removed labels: ****
I was trying to concentrate on table diff enhancement, but this issue really 
gets in 
the way. So I will try to fix this first.
The splitting in the structures like ordered list, tables and definition list 
can 
really confuse the user (like an ordered list has changed its attributes, but 
retained its 6 list items numbered 1-6 , however some piece of text was 
deleted. The 
splitting even if you put it in a <li> will add another item somewhere in the 
middle 
and all unchanged list items below it will be re-numbered differently - not 
good).

Other dangerous moments with the splitting:
If the style of the split element uses positioning attributes, the 2 split 
pieces 
will be on top of each other, and one might not be visible at all.

On the other hand splitting allows you to put the deleted text exactly in the 
spot 
where it have been (text-wise).

Full rejection of the splitting will cause the deleted text to be "away" from 
the 
place it was deleted from, which is also kind of confusing.

What do we do? :)

One way is to "avoid" dangerous splitting. On the way from the child to the 
common 
ancestor only "safe" structures are split. "Unsafe" structures are kept intact 
and 
the deleted text is placed right after the unsplit structures. This means from 
the 
common ancestor to the first "unsafe" structure we split and not past that 
(however 
we don't process styles at this moment). So at first we concentrate just on the 
invalid HTML structure (all kind of lists and tables). Since with the 
positioning 
danger we still get valid HTML (though not very usable), we can file it as a 
separate issue and fix later.
Example1: 
one list of ancestors (changed are showed in angle brackets):
body, div, <div>, <ul> <li> <--  <deleted text>
another list:
body, div, <div>, <ul> <li> <-- retained text
<div> will be split, <ul> and <li> will not be split. deleted text will be 
place 
right after </ul> (in another list probably?)

Example2:
body, <paragraph> <-- <deleted text>
body, <table>, <tr>, <td> <-- retained text (was moved into the table)

nothing will be split, deleted text will be place right after the table.

Example3:
body, ul, li, <div> <p> <-- <deleted text>
body, ul, li, <div> <p> <-- retained text

everything will be split.

I can't come up with a good annotation. If you have suggestions - please share.
Also, please let me know if you have any problem with this fix.

Thank you!

Attached is the table madness sample. This fix should make the result much 
prettier.

Original comment by anastass...@businesswire.com on 25 Apr 2009 at 12:21

  • Added labels: Priority-High
  • Removed labels: Priority-Medium

Attachments:

Proposed fix - see the attached source changes and the new result for the 
samples 
above

Has side effects (see group discussion)

The TableStatics.java file will be used for proposed table diff enhancement 
later.
Now it only provides few constants for this current fix.

Original comment by anastass...@businesswire.com on 29 Apr 2009 at 9:41

  • Changed state: Started
  • Added labels: ****
  • Removed labels: ****

Attachments: