htacg / tidy-html5-tests

Regression testing files and tools for HTML Tidy.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Need to check line ending for tests 500236 and 661606

geoffmcl opened this issue · comments

These two tests are ok in windows, but for some strange reason when cloned into linux, the expected files retain the windows line ending?

This means the diff compare fails in unix, and probably mac, unless an option added to ignore space, like -w, which should not be required.

For the moment these tests have been moved to the cases/specials folder, and removed from the tests manifest cases\testbase\_manifest.txt.

We need to discover why git does this, and somehow fix it.

A less liked alternative would be to instruct tidy, through its config, to output a matching CRLF - see newline - but this should not be required!

Needless to say, at some point they must be recovered from special and added back to the testbase manifest.

BAH! I just added back test 500236 in a test1 branch, but again when I pulled it in linux the html still had windows line endings!!! How can this happen???

So in linux I used dos2unix to convert it. And now in unix the test passes cleanly. Committed and pushed,

But when I pull it in windows, it now has mixed line ending!!! Well there is one CR, but most lines now have linux LF line ending...

So would have expected a diff, but it seems my WIN32 port of GNU diff.exe does not see this difference, so at least it passes. But it shouldn't!

Really need help to understand what is happening here.

This does not seem to happen with hundreds of other files. Normally a pull in windows and they have CRLF, and a pull in linux and they will have linux line endings, LF.

My windows git config -l shows core.autocrlf=true... while in linux it is false...

What is happening on this particular file???

NEED HELP

I wonder if git is getting confused by the 
 string in the file? If I download the .zip on Windows, I can see that it's stored in the repo with newline characters. I'll have to try git on my own Windows machine tonight to double-check its behavior.

By the way, that test is in the XML tests, too.

@balthisar I have not yet experimented with the XML tests, but will get to that...

Just doing some checking of the repo clone versus the repo zip... to see if there are any clues there...

Windows - repo branch test1 clone

I now note in cases\testbase-expects, of the 445 files, 443 have windows CRLF. There are two with unix LF, case-500236.html and case-647255.html. So why are these two different?

And also checking cases\testbase, of the 309 files, 307 have windows CRLF. And two other files, case-427633.html and case-647255.html have unix LF! Of course we are not doing a compare using this folder, so that is not particularly a problem. But again why are these different 2 different?

Windows - download tidy-html5-tests.zip

Now the ZIP shows a different picture all together!

Of the 445 files in cases\testbase-expects, only 6 have windows CRLF, and the balance 439 have UNIX LF!

Linux - download tidy-html5-tests.zip

Of the 445 files in cases\testbase-expects, only 6 have windows CRLF, and the balance 439 have UNIX LF!

WOW, ok that is interesting. That is the same as the windows zip!

Linux - repo branch test1 clone

cases\testbase-expects - Of 445, 439 have unix LF, 6 have windows CRLF, again the same as the zips!

Summary

If is my understanding, and someone please correct me if I am wrong, for text files, repositories do not keep the original line endings, just the lines themselves. And when you clone a repository these text files will be given a line ending to suit your git configuration. In my case they will get windows CRLF in windows, and linux LF in linux. Or conversely the repo only keeps unix line ending...

This latter may be the case since I seem to remember reading somewhere that with core.autocrlf=true in windows, git will upload files to the repo using only the unix LF line ends, and download with windows CRLF line endings... This is further born out by if I do create in windows a file with unix LF only, which only some editors can do, on commit git will make mention of that files line endings not being CRLF. And the next time I load that file in the MSVC IDE editor, it will also immediately offer to 'fix' this file... So in windows it is neccessary to have all text files with native windows CRLF.

Now this is completely different if the file is marked as binary. A binary file has no lines. It is just a raw blob. And when you change just one byte of that raw blob, the whole files is replaced in the repo. There can be no line-by-line compare to keep just a diff.

Now some assumptions. When you ask for a ZIP, you get the unix default line endings. This is why the zip in windows shows the same mixture of line ending as the zip in linux, and why the linux clone exactly mirrors the zip contents. This assumes of course you are using an unzip that does not mess with line endings. Like in unix you do not add -a to auto-convert any text files!

So based on the ZIP, and only dealing with cases\testbase-expects, since this is the compare, of the 445, there seems 6 files that have windows CRLF - case-1067112.txt case-434940.html case-473490.txt case-543262.txt case-586555.txt case-649812.html - Are these the ones that are marked binary? Why this 6?

But all that neat reasoning breaks down when I look at my windows clone! Of the 445, only 2 do not have CRLF - case-500236.html and case-647255.html - which are not in the 6!

And I read more that this can be changed using the text attribute in the .gitattributes file. And in this tests repo we have one that already deals with the line endings of bat and cmd files. Is there more that can be done with this? But somehow this does not feel like a good solution... and have not seen any problem with bat/cmd files...

The very important aim is that all tests pass in all OS! So that any diff can be treated as a problem to be looked at carefully, with the aim of updating the appropriate expects file. Otherwise mainstream testing becomes murky...

If there are exceptions, like the one known test 431895, that contains gnu-emacs: yes, which will always have different path separators unix/win32, and some other known OS/encoding differences, then these must be the subject of a different type of test... Or an appropriate fix in Tidy code of course...

But this issue is about getting these two, and maybe some others, into the mainstream testing, if at all possible...

Anyway, still seek full understanding... and a solution...

You've made me realize a problem with my Window testing strategy: I test in Windows VM from files shared from my Mac, meaning that they always have LF, even in Windows. I'll have to checkout the repo in Windows from now on. Incidentally that's why .gitattributes forces CRLF; it's so I can edit batch files in my Mac editor instead of taking up screen space inside the VM with my favorite Windows editor. I will have to stop that practice! (I think I can add a second monitor to the VM.)

Ah, but our line ending bugs! For three files you mentioned, I think two of them are explainable, and one is still a complete mystery. EDIT: Actually I think all three are explained.

647255

This is a UTF-16LE file, so automatically git treats this as binary. Confirming:

jderry$ git diff ad4d --numstat HEAD -- case-647255.html 
-       -       cases/testbase-expects/case-647255.html

The result should be the number of lines in the text file, followed by the number of changes between the two diffs. When they're blank, it indicates Git thinks a file is binary.

I think this is explained, and we can comb the net for a solution.

427633

This one is interesting. First does Git treat it as binary?

$ git diff ad4d --numstat HEAD -- case-427633.html 
10      0       cases/testbase-expects/case-427633.html

No; Git knows it's text. However I'm guessing that Git doesn't know what to do with the line endings because there is extra garbage in the results file starting at position xD3: there's a LFSPSP (0x0A2020) sequence between "a DOS."

Looking at the source file, it might be mangled, too. There's a nice CR right where it's supposed to be at 0xAE, a nice LF where it's supposed to be at 0xC9, but the DOS line ending at 0xE3 has CRLFLF (0x0D0A0A).

Incidentally if I look at the original file, it has the spurious extra LF in the DOS ending, too.

This causes me to ask two more questions:

  • Is the extra LF in the source file a condition that we're supposed to fix?
  • If it is a condition we're supposed to fix, then is Tidy fixing it correctly?

I think Tidy has a bug, and maybe the input file has a bug, too. Git simply doesn't know what to do when there are extra characters.

500236

$ git diff ad4d --numstat HEAD -- case-500236.html 
5       0       cases/testbase-expects/case-500236.html

Indicates 5 lines, 0 changes between ad4d and HEAD, so a text file.

However look at character position 0xFA… why is there a CR/x0D here? And why is it followed by two spaces?

Going back to the source which I mistakenly looked at yesterday, the bytes 
 and 
 look like the first is being output literally and the second is dropped or turned into a space.

This is probably a Tidy bug, and it's probably causing git to not know what to do with the file.

So what I've done is pushed result-cleaned--500236.html and result-cleaned-427633.html (oops, one of the names is mangled) with the questionable bytes deleted. I've push these to your test1 branch in the cases/ directory.

No need to keep these around, but if I were a betting man, I'd wager a lot that they're going to have correct line endings on Windows when you pull the repo.

I've also added a testutf branch with a slight change to .gitattributes; let's see if it takes care of that UTF-16 file or not. In this case it's target towards the one, specific UTF16 file we know about; it may be safe to apply this to all html, xhtml, and xml files, but let's see if this works first.

I'm not sure how well gnu diff will work with UTF16, though.

I can't install Git on my current machine, but will clone and test these repos tonight and see if there's any effect.

@balthisar thanks for digging into this mystery ;=))

Wow, you introduce another level of complexity - cross platform testing - like comparing the expected in one OS with the results in another, as you seem to have done using a VM - OS X to WIN32. Well, if this is likely to be done by others maybe we should add something to the readme like - "cross platform compare testing can yield ambiguous results, mainly due to different line endings!"

So first to look at and deal with the 2 new test numbers individually, only really suggested by examining the zip, but are they a problem?

647255 - UTF-16LE

Thanks for showing me a method to test if git considers a file as binary or not. And it seems the presence of a BOM can cause git to consider it binary. Good to know...

But this test passes in Windows and Linux. It seems both my WIN32 port of diff 2.7, and my linux diff 3.3, have no problem dealing with this BOM, with or without the -a parameter. The input, expects and results have this BOM.

Is there some problem in OS X?

If not then we can forget this one. And thus forget your new testutf branch, which I did not try...

If yes, then we need to exclude it for now, until we find a solution, maybe offered by your testutf branch. Advise.

427633 - Mixed EOL

Yes, this was inherited with mixed line endings, specifically as a test of tidy handling of such a mixture, so should not be cleaned, sanitised...

But again this test passes in Windows and Linux.

Is there some problem in OS X?

Likewise, if not then we can forget this one.

If yes, then we need to exclude it for now, until we find a solution, but that should not be by cleaning the file since that is the reason for the test.

And if there continues to be a problem with either of these two tests, in OS X, then maybe they can be a separate issue(s)? Advise.

So that gets me to the two tests, the subject of this issue

  • 500236 another mixed, and
  • 661606 with shiftjis

500236 - Mixed EOL

Since EOL is not the subject of the test, I would have no problem with fixing this corruption. As far as I can tell back in SF cvs it was not so corrupted!

And that is what I really tried to do in the test1 branch, but it failed!

But I now think maybe git keeps a file memory, and thus even though I thought I had fixed it, git still generated a mess, pulling expects as a LF in windows! Yuk! You showed git does not seem to consider it binary... but once given different line ending, somehow, git sort of holds on to them...

But as you also showed, giving it a new name, result-cleaned--500236.html, and predicted, and it correctly pulls as CRLF in windows, and as LF in unix, as it should...

So I propose adding a new test 500236-1, adding your cleaned version, adjusting the manifest, and this should be fixed. And we just completely delete the current 500236. This could be directly done in master.

What do you think?

661606 - shiftjis

As we know this uses a config of char-encoding: shiftjis, and contains some 'strange' characters, so if possible I would like to get it back into the mainstream... it is the only example we have of its kind...

I will find time to examine this further, and hope you can also cast an eye on it... thanks...

@geoffmcl, I wasn't clear enough on the tests I reported on. I was referring to the output files; I think they're bad and reflect bugs in Tidy. The reason the renamed files work in the repository now is because I removed the "bad" bytes from the output files. I suspect that Git sees the spurious CR's and then refuses to change the line endings. By editing them out manually, Git knows what to do with them.

A quick recap:

647255

I've read that different versions of Gnu diff have trouble with UTF16. It looks like it's working for both of us on Windows, Linux, and Mac, so if that's the case then no need to change the .gitattributes.

427633

Have another look at my description. I think this is a Tidy bug. Why is Tidy introducing a line feed between the letters "a" and "DOS" in the output file? The original input only has a single space. And given this was a Windows generated file, there shouldn't be any LF in the output, but especially not in the location it's at.

I suspect that Git doesn't convert the line endings because it sees this CR and refuses to touch the line endings. The reason the "renamed" file converts line endings is not because I renamed it, but because I edited out the stray CR.

500236

Really, the same as above. The output has stray CRs that I feel shouldn't be there. I think this is a Tidy bug.

Sure, we could accept these as part of the regression test suite simply because Tidy is acting the same on all three platforms.

661606

If I have time I'll take a look at it today. My time has been split lately!

@balthisar sorry if you think I did not read your comments correctly... my bad?

And we do appreciate that this is only when you have the time to donate to tidy... As always RL must come first!

Tidy Bugs

While I agree the tests might expose a tidy bug, just trying to suggest that they should be filed in tidy-html5 issues.

For example, if I add a simple test, say in-329-4.html, with the content -

<p>A<em>
B</em></p>

What should we add as the testbase-expects output?

With tidy 5.1.45 we would have to add a bad output of <p>A<em>B</em></p>, ie no space after the A, for the test to pass, because tidy presently has a bug that loses the newline after the <em>!

Now when that bug is fixed in say 5.1.46++ the test would fail. But this would indicate not a regression, but an improvement, an enhancement, a fix, and we have to update the expects output to pass -

<p>A <em>B</em></p>

And I have a tidy5-5.1.44issue329.exe, from the issue-329 branch, that will do this, but may have other problems, so needs more testing, fixing...

All I am saying is that if tidy has a bug, then in this testing repo, if we choose to add that test, then we need to also provide an output that will not be different to what that version of tidy outputs!

And at the same time file a tidy-html5 issue looking at this bug.

I hope we on the same page here?

647255 - UTF-16

ok, if a person finds a diff that has trouble with UTF-16 encoding, then I hope they will speak up about it, raise an issue here, and we should then address the problem.

But this test seems to pass in all our tests, admittedly only across some 6 systems, then it seems we agree there is no need to change anything at this time.

427633 - mixed EOL input

Has mixed EOL input, but correct output as far as I can see...

Why is Tidy introducing a line feed between the letters "a" and "DOS"

Well, it would be bug if it didn't! This is normal line wrap at about 68 characters...

The file testbase-results\case-427633.html, generated a little while ago, using 5.1.45-Exp2, which should be equivalent to 5.1.45 in this respect, in my windows system the output has only CRLF, followed by 2 indent spaces, included between letters "a" and "DOS", and I suspect this is due to line wrap.

Yes, adding -w 0 and I get one line <p> to <\p>, so I am afraid I do not yet see a tidy bug here!

And in linux, likewise have only a LF 20 20, due to line wrap... Where is the bug?

As suggested, I do want to keep the mixed nature of the input file, in testbase, since that seems the reason for the test.

As usual maybe I am still missing something?

500236 - entities

Ok, my previous blathering was wrong! Sorry. Right from the beginning this test has had two entities &#13;&#10; in the input source, which you mentioned earlier, and I had been overlooking... Now I think they cause the problem.

Now with that input tidy will rightly, or wrongly!, output a single CR, as -xml or html, word-2000 or not. That is, I do not know if that is a bug, or not, but that for sure messes up git.

As you state, on seeing that CR git will fail to translate the line ending. If I push the windows output with all CRLF, plus that single CR, then that is what I will get in linux. And, as of now, if I push a linux output will all LF, plus that single CR, then that is what I will get in windows. That is mess we have now...

But looking at the test carefully I do not think these two entities are really part of the test. The comments in the file suggest the fix being tested was the suppression of a spurious error Error: missing quotemark for attribute value.

And going back to a tidy circa 2000 - HTML Tidy release date: 4th August 2000 - does indeed issue this error. But by tidy circa 2004 - HTML Tidy for Windows released on 1st July 2004 - this error has been removed. Tidy has been fixed.

I think this is the original purpose of the test. The entities can safely be removed from the input source.

So if you agree, I will push a 500236-1 test, without these entities, fix the manifest, and expects, and I think we will have no further problem with this test.

661606

I too will try to find the time to look deeper into this...

Are we good?

431895 - gnu-emacs

Of course, even though I have reduce the file paths back to relative in the scripts, there is still the different path separators unix/win32. This must be the subject of a different type of test, OR an appropriate fix in Tidy code of course...

500236 - entities

As suggested above, added a 500236-1 test, only in the special folder, without the &#13;&#10; entities, and it now passes in Windows and Ubuntu... still to test in other environments...

661606 - shiftjis

No doubt due to the shiftjis content git considers this file binary, and will thus not do auto endline translations. Since I added the expects using Windows, it retains the CRLF in unix.

So I added newline: CRLF to its case-661606.conf config file, and now it passes in Windows and Linux... still to test other environments...

Just ran testspl.sh in RPI Raspbian, and all pass, except the expected 431895-emacs!

So now, baring failure in other OSes, most notably OS X, these other two, 500236-1 and 661606, and it looks like 431958, could be moved out of 'special' and added back to 'testbase'. Agreed?

What should we do about this 431895 test? I have offered a small patch to fix the emacs file name to use unix path separators. This would not be a frequently used, popular option in windows, and an intelligent editors that accept ..\path\to\file.html:25 would also probably accept ../path/to/file.html:25!

While some windows commands, like dir, will absolutely not accept a unix forward slash, that is an option switch, most windows editors, and other programs, do not have a problem. So, a small modification to tidy.c code could fix this.

Seek comments on this last outstanding 431895 problem.

Recent tests, with say the custom_tags branch seems to indicate this is all solved, closed...