aantron / lambdasoup

Functional HTML scraping and rewriting with CSS in OCaml

Home Page:https://aantron.github.io/lambdasoup

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Should texts function include script tags?

mooreryan opened this issue · comments

Hi aantron, I was parsing some HTML and got a result I thought was interesting. The <script> tags are included in the output of the texts functions. I can see how it would be since the script is text after all, but I was wondering if this was the intended behavior.

Just to make sure I didn't make any mistakes (and to show you what I mean) I made these little tests that pass the lambdasoup test suite.

           ( "texts-just-script-tags" >:: fun _ ->
             let soup = "<script>1 + 1</script>" |> parse in
             assert_equal (texts soup) [ "1 + 1" ] );

           ( "texts-script-tags" >:: fun _ ->
             let soup =
               "<article><div><p>hi</p></div><script>1 + 1</script></article>"
               |> parse
             in
             assert_equal (texts soup) [ "hi"; "1 + 1" ] );

Anyway, just wondering if this is the intended behavior, and if so, I suppose the easiest way would be to just filter out <script> tags before using the texts functions? Thanks!!!

This is the intended behavior so far. Perhaps you can delete <script> tags before calling texts with

Soup.iter Soup.delete (soup $$ "script")

Sounds good, thanks! I will close the issue now.