mojolicious / mojo

:sparkles: Mojolicious - Perl real-time web framework

Home Page:https://mojolicious.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Mojo::DOM doesn't recognize end of comment (when it should)

mauke opened this issue · comments

commented
  • Mojolicious version: 9.31
  • Perl version: v5.36.0
  • Operating system: Ubuntu 22.04.1 LTS

Steps to reproduce the behavior

#!/usr/bin/env perl
use v5.12.0;
use warnings;
use Test::More;
use Mojo::DOM;

for my $fragment (
    '<!--> <p>OK</p> <!-- -->',
    '<!---> <p>OK</p> <!-- -->',
    '<!-- --!> <p>OK</p> <!-- -->',
) {
    my $dom = Mojo::DOM->new("<!DOCTYPE html>\n$fragment");

    is_deeply
        $dom->find('p')->map(sub { $_->to_string })->to_array,
        ['<p>OK</p>'],
        "HTML fragment '$fragment' parses as comment + p + comment";
}

done_testing;

Expected behavior

Test passes.

After <!-- we are in the comment start state. If the very next characters are > or ->, an abrupt-closing-of-empty-comment error occurs. If the parser doesn't abort entirely, it must treat the comment as closed and switch back to the data state.

Otherwise, if we see --! inside a comment, we switch to the comment end bang state. If the very next character is a >, an incorrectly-closed-comment error occurs. If the parser doesn't abort entirely, it must treat the comment as closed and switch back to the data state.

Actual behavior

not ok 1 - HTML fragment '<!--> <p>OK</p> <!-- -->' parses as comment + p + comment
#   Failed test 'HTML fragment '<!--> <p>OK</p> <!-- -->' parses as comment + p + comment'
#   at mojo-dom-bug-8.pl line 15.
#     Structures begin differing at:
#          $got->[0] = Does not exist
#     $expected->[0] = '<p>OK</p>'
not ok 2 - HTML fragment '<!---> <p>OK</p> <!-- -->' parses as comment + p + comment
#   Failed test 'HTML fragment '<!---> <p>OK</p> <!-- -->' parses as comment + p + comment'
#   at mojo-dom-bug-8.pl line 15.
#     Structures begin differing at:
#          $got->[0] = Does not exist
#     $expected->[0] = '<p>OK</p>'
not ok 3 - HTML fragment '<!-- --!> <p>OK</p> <!-- -->' parses as comment + p + comment
#   Failed test 'HTML fragment '<!-- --!> <p>OK</p> <!-- -->' parses as comment + p + comment'
#   at mojo-dom-bug-8.pl line 15.
#     Structures begin differing at:
#          $got->[0] = Does not exist
#     $expected->[0] = '<p>OK</p>'
1..3
# Looks like you failed 3 tests of 3.
commented

The entire block/line is treated like a comment:

perl -Mojo -E 'my $x = x "<!-- --!> <p>OK</p> <!-- -->"; say r $x'
bless( do{\(my $o = bless( {
    "tree" => [
      "root",
      [
        "comment",
        " --!> <p>OK</p> <!-- ",
        ${$VAR1}->{"tree"}
      ]
    ]
  }, 'Mojo::DOM::HTML' ))}, 'Mojo::DOM' )
commented

In Mojo::DOM::HTML:

In $TOKEN_RE, although the comment portion is not greedy, it still consumes too much:

--(.*?)--\s*

commented

Very interesting. Although we are using (.*?), it appears that there is perhaps some optimization when matching the trailing > on the line with:
)>

commented

Please take a look at the PR and let me know if that would work for you.

I included the test cases from the description.