Mojo::DOM misparses <script> elements
mauke opened this issue · comments
- Mojolicious version: 9.30
- Perl version: v5.36.0
- Operating system: Ubuntu 22.04.1 LTS
Steps to reproduce the behavior
#!/usr/bin/env perl
use v5.12.0;
use warnings;
use Mojo::DOM;
my $dom = Mojo::DOM->new(do { local $/; scalar readline DATA });
say for $dom->find('div')->each;
__DATA__
<!DOCTYPE html>
<h1>Welcome to HTML</h1>
<script>
console.log('< /script> is safe');
/* <div>XXX this is not a div element</div> */
</script>
Expected behavior
No output as the document contains no div
elements. (document.querySelectorAll('div')
in a browser agrees.)
Actual behavior
Output:
<div>XXX this is not a div element</div>
I've not looked at the spec yet, but this would probably be the section to check for the correct behavior.
This one looks relevant: https://html.spec.whatwg.org/multipage/parsing.html#script-data-less-than-sign-state
After seeing a <
in a <script>
element, the parser looks at the next character. Only !
and /
are special. For any other character (including space), the <
is parsed literally and scanning continues.
This line probably needs some fixing.
xmllint appears to recognize the <script>
block all the way to the final closing </script>
(though it seems to have issues with comments):
$ xmllint --html --debug mojo-issue-2014.html
mojo-issue-2014.html:5: HTML parser error : Unexpected end tag : div
/* <div>XXX this is not a div element</div> */
^
HTML DOCUMENT
URL=mojo-issue-2014.html
standalone=true
DTD(html)
ELEMENT html
ELEMENT body
ELEMENT h1
TEXT
content=Welcome to HTML
TEXT
content=
ELEMENT script
CDATA_SECTION
content= console.log('< /script> is safe'); ...`
$ xmllint --html --xpath //div mojo-issue-2014.html
mojo-issue-2014.html:5: HTML parser error : Unexpected end tag : div
/* <div>XXX this is not a div element</div> */
^
XPath set is empty
$ xmllint --html --xpath //script mojo-issue-2014.html
mojo-issue-2014.html:5: HTML parser error : Unexpected end tag : div
/* <div>XXX this is not a div element</div> */
^
<script><![CDATA[
console.log('< /script> is safe');
/* <div>XXX this is not a div element */
]]></script>
Also fixed in @mojojs/dom
. mojolicious/dom.js@90ad748