Mojo::DOM doesn't recognize end of comment (when it should)
mauke opened this issue · comments
- Mojolicious version: 9.31
- Perl version: v5.36.0
- Operating system: Ubuntu 22.04.1 LTS
Steps to reproduce the behavior
#!/usr/bin/env perl
use v5.12.0;
use warnings;
use Test::More;
use Mojo::DOM;
for my $fragment (
'<!--> <p>OK</p> <!-- -->',
'<!---> <p>OK</p> <!-- -->',
'<!-- --!> <p>OK</p> <!-- -->',
) {
my $dom = Mojo::DOM->new("<!DOCTYPE html>\n$fragment");
is_deeply
$dom->find('p')->map(sub { $_->to_string })->to_array,
['<p>OK</p>'],
"HTML fragment '$fragment' parses as comment + p + comment";
}
done_testing;
Expected behavior
Test passes.
After <!--
we are in the comment start state. If the very next characters are >
or ->
, an abrupt-closing-of-empty-comment error occurs. If the parser doesn't abort entirely, it must treat the comment as closed and switch back to the data state.
Otherwise, if we see --!
inside a comment, we switch to the comment end bang state. If the very next character is a >
, an incorrectly-closed-comment error occurs. If the parser doesn't abort entirely, it must treat the comment as closed and switch back to the data state.
Actual behavior
not ok 1 - HTML fragment '<!--> <p>OK</p> <!-- -->' parses as comment + p + comment
# Failed test 'HTML fragment '<!--> <p>OK</p> <!-- -->' parses as comment + p + comment'
# at mojo-dom-bug-8.pl line 15.
# Structures begin differing at:
# $got->[0] = Does not exist
# $expected->[0] = '<p>OK</p>'
not ok 2 - HTML fragment '<!---> <p>OK</p> <!-- -->' parses as comment + p + comment
# Failed test 'HTML fragment '<!---> <p>OK</p> <!-- -->' parses as comment + p + comment'
# at mojo-dom-bug-8.pl line 15.
# Structures begin differing at:
# $got->[0] = Does not exist
# $expected->[0] = '<p>OK</p>'
not ok 3 - HTML fragment '<!-- --!> <p>OK</p> <!-- -->' parses as comment + p + comment
# Failed test 'HTML fragment '<!-- --!> <p>OK</p> <!-- -->' parses as comment + p + comment'
# at mojo-dom-bug-8.pl line 15.
# Structures begin differing at:
# $got->[0] = Does not exist
# $expected->[0] = '<p>OK</p>'
1..3
# Looks like you failed 3 tests of 3.
The entire block/line is treated like a comment:
perl -Mojo -E 'my $x = x "<!-- --!> <p>OK</p> <!-- -->"; say r $x'
bless( do{\(my $o = bless( {
"tree" => [
"root",
[
"comment",
" --!> <p>OK</p> <!-- ",
${$VAR1}->{"tree"}
]
]
}, 'Mojo::DOM::HTML' ))}, 'Mojo::DOM' )
In Mojo::DOM::HTML
:
In $TOKEN_RE
, although the comment portion is not greedy, it still consumes too much:
--(.*?)--\s*
Very interesting. Although we are using (.*?)
, it appears that there is perhaps some optimization when matching the trailing >
on the line with:
)>
Please take a look at the PR and let me know if that would work for you.
I included the test cases from the description.