aantron / lambdasoup

Functional HTML scraping and rewriting with CSS in OCaml

Home Page:https://aantron.github.io/lambdasoup

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to use `$`?

vitalydolgov opened this issue · comments

It's not actually an issue, rather a question on usage of Lambdasoup. For some reason I cannot use selector $ after taking a node by number (the second statement after binding). But it works well if I convert node to string and parse it again, or take element of node explicitly.

Is it an intentional behavior? In the source code I see no restriction on the node type, so I'm a bit confused...

# #require "lambdasoup";;

# open Soup;;

# let s = "<p class=\"txtRed\">AA * A<span class=\"txtNormal\">B</span> * A<span class=\"txtNormal\">C</span></p>";;
val s : string = ...

# s |> parse $ "p" |> children |> R.nth 2 |> to_string;;
- : string = "<span class=\"txtNormal\">B</span>"

# s |> parse $ "p" |> children |> R.nth 2 $? "span";;
- : element node option = None

# s |> parse $ "p" |> children |> R.nth 2 |> R.element |> name;;
- : string = "span"

# s |> parse $ "p" |> children |> R.nth 2 |> to_string |> parse $ "span" |> to_string;;
- : string = "<span class=\"txtNormal\">B</span>"

In

# s |> parse $ "p" |> children |> R.nth 2 $? "span";;

$? selects from the descendants of the given node, in other words it is searching the DOM corresponding to the string B, and of course there are no elements at all to find there.

The reason this might be confusing is because the top-level node returned by parse is not the <p> element, but a "soup" (document) node which contains the <p> element as its child. It is done that way because, in general, the string you pass to parse may contain multiple elements, and indeed multiple nodes, since it might contain text at the top level.

Likewise, when you convert your span DOM to a string and pass it back to parse, you get back a DOM consisting of a document whose child is the span element. I guess it's pretty annoying and non-algebraic that trying to round-trip an element through the parser doesn't give back an element, but a document containing that element.

@aantron thank you for the quick answer, now I get it. That's not a problem, the library is very convenient to use 😊