aantron / lambdasoup

Functional HTML scraping and rewriting with CSS in OCaml

Home Page:https://aantron.github.io/lambdasoup

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Doctypes other than HTML5 aren't emitted correctly

dmbaturin opened this issue · comments

I'm sorry for not giving #32 proper testing!

Your implementation only works for the trivial HTML5 doctype. This makes it impossible to keep the real original doctype for any legacy pages.

Example:

$ cat templates/main.html 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html lang="en">
  <head>
...

$ utop
utop # #require "lambdasoup";;
utop # Soup.read_file "templates/main.html" |> Soup.parse |> Soup.pretty_print;;
- : string =
"<!DOCTYPE html><html lang=\"en\">\n <head>\n  

This is because Markup's HTML serializer follows the spec here, which allows these doctypes only:

  • <!DOCTYPE html>
  • <!DOCTYPE html SYSTEM "about:legacy-compat">

That's obviously too strict. I think this is best solved by actually stripping off the doctype if it is present, rather than sending it to Markup, and outputting it directly from Lambda Soup. If someone later wants legacy doctype support in HTML in Markup, we can move the support there and make Lambda Soup use that, as an internal change.

Also note that XHTML should probably be parsed using the XML parser in Markup, and serialized with the XML serializer. That should have precluded most of these issues, which arise from parsing XML with an HTML5 parser and serializing it with an HTML5 serializer.

I understand that the XML parser might not be as convenient, because you would have to use Soup.from_signals and Soup.signals to plug it into Lambda Soup.

Given that, how would you like to proceed?

Are you facing a mixture of documents, some of which are HTML5 and some XML, and need some kind of easy way of Markup/Lambda Soup automatically figuring out which parser and serializer to use? I haven't considered that before, but it might be the right thing to do.

It's an interesting question. It's not a problem I'm having personally, so it's mostly a theoretical issue. I suppose maybe we should leave it as is until a person who actually has that problem comes along.

Can you test again with opam pin add markup --dev-repo?

I modified Markup's HTML writer to output whatever doctype is there, instead of trying to output only HTML5 doctypes. This seems reasonable, given that there is Markup.html5 already available for the user to force an HTML5 doctype, so we should trust whatever doctype is present in the signal stream.

I also added a test to Lambda Soup for round-tripping the doctype from this issue. It passes with Markup master and fails with Markup 1.0.0 release.

See the two commits linked above if curious about the code.

If this works for you, I will release it in Markup 1.1.0. Lambda Soup won't need a release.

Note that Markup master now requires Dune 2.7.0, due to its dune files using the syntax for the new Bisect_ppx integration. If that's too high a constraint, I can downgrade it. I was hoping that the next release of Markup wouldn't be so soon, and there would be time for Dune 2.7.0 to become pretty old and established :)

I've just tested it and it seems to work fine. Thanks for the fix!

I actually have both markup and lambdasoup dependencies specified explicitly in soupault.opam due to some previous situation, so I don't mind.
I don't mind dune 2.7.0 either.

I'll give it a day in case anything else comes to mind, as the doctype is proving to be a rich source of problems for Lambda Soup and Markup.ml :P

Well, I'm working on some other improvements for soupault 2.1.0 too, I wasn't planning to release it right today either. ;)

One visual oddity: traditionally people put a newline between the doctype and the <html> tag when writing HTML by hand, but markup.ml doesn't. That's purely cosmetic of course.

Can you open an issue on Markup.ml about this and any other such things? I think we can fix this in the pretty printer (I am assuming based on the code at the beginning of this issue, that you are using it).

And ping me if/when you would like a release. I'll wait a bit to hopefully include several changes, and then do it.

aantron/markup.ml#58

I need to sit with the Markup.ml code to actually understand it so that I can contriubute in the future...

@aantron I've completed my work on changes I want in soupault 2.1.0, so I'm ready to release it whenever the new Markup release is available.

I plan to do aantron/markup.ml#59 and retag 1.0.0 and release it in the first half of the coming week.

Yeah, I've seen that issue. Whenever the re-tagged markup 1.0.0 is available from the OPAM repos, I'll set the dependency to markup { >= 1.0.0 } and tag the release.