aantron / lambdasoup

Functional HTML scraping and rewriting with CSS in OCaml

Home Page:https://aantron.github.io/lambdasoup

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

<frame> markup seems not to be detected

yannham opened this issue · comments

I ran into a strange problem when trying to write a small scraping library with lambdasoup. On my simple HTML test file, lambdasoup doesn't seem to be able to select the frame markup. The page seems to be at least valid xml (I may not respect some HTML markup usage constraints).

let page = Soup.read_file "index.html" |> Soup.parse;; page $ "frame";;

gives in utop
"Exception: Failure "Soup.($): cannot select 'frame'"."

while selecting anything else like img, form, frameS, ul, li, div, etc. is working fine.
I'm using ocaml 4.03.0 with lambdasoup 0.6.1.
You can find my test page here : yago.gb2n.org/test-lambdasoup.html

Markup.ml drops frame elements in body. This behavior is compliant with the HTML5 specification (search for text "frame", including quotes, in that section – sorry, the link is the closest anchor I could find):

8.2.5.4.7 The "in body" insertion mode
-> A start tag whose tag name is one of: "caption", "col", "colgroup", "frame", "head",
   "tbody", "td", "tfoot", "th", "thead", "tr"
     Parse error. Ignore the token.

I inspected the DOM in Chrome, and Chrome likewise dropped the frame elements from both the body and the table.

Do you mean iframe? frame is only allowed inside frameset. AFAIK there is also no frames tag in HTML.

If you will be parsing bad HTML, you might want to do this for easier debugging:

let report location error =
  prerr_endline (Markup.Error.to_string ~location error) in
let page =
  Markup.(file "index.html" |> fst |> parse_html ~report |> signals)
  |> Soup.from_signals
in
ignore (page $ "frame")
(* ... *)

This shows the errors:

line 45, column 9: misnested tag: 'frame' in 'table'
line 50, column 5: misnested tag: 'frame' in 'body'

Indeed, showing Markup.ml errors reveals several other problems with the markup:

  • The title element should be inside head (outside head, it silently creates a head element, and then there is an explicit head element, which is an error).
  • There shouldn't be (IIRC) an img or frames tag at the top level of a table.

If you want to disregard HTML rules by the way, you may be able to get by by parsing as XML:

let page =
  Markup.(file "index.html" |> fst |> parse_xml |> signals)
  |> Soup.from_signals
in

But in HTML mode, what gets parsed corresponds to what browsers accept and users actually see (modulo any bugs lurking in Markup.ml).

I see, it makes sense now ! I don't know why I assumed it would be parsed as XML by default. Thanks for the answer

Sure. Amendment: to parse HTML as XML you should probably translate HTML entities as well:

Markup.(parse_xml ~entity:xhtml_entity)

Oh, and if you want a simple command-line tool for showing Markup.ml errors (i.e. everything in the syntax section of the HTML spec), @fxfactorial made valentine for this purpose a while back :)

Seems cool, I'll take a look !