mojolicious / mojo

:sparkles: Mojolicious - Perl real-time web framework

Home Page:https://mojolicious.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Mojo::DOM treats "< foo" as start tag, phantom elements ensue

mauke opened this issue Β· comments

commented
  • Mojolicious version: 9.31
  • Perl version: v5.36.0
  • Operating system: Ubuntu 22.04.1 LTS

Steps to reproduce the behavior

#!/usr/bin/env perl
use v5.12.0;
use warnings;
use Test::More;
use Mojo::DOM;

my $dom = Mojo::DOM->new('if a < script then="<!--"> </script> <p>FAIL</p>-->');

is_deeply
    $dom->find('script, p')->map(sub { $_->to_string })->to_array,
    [],
    'fragment contains no tags, just a comment';

done_testing;

Expected behavior

Test passes.

After seeing <, we are in the tag open state. The semantically relevant characters that can follow are !, /, ASCII alpha, ?, and EOF. Anything else (including spaces) triggers an invalid-first-character-of-tag-name error. If the parser doesn't abort, it should treat the < character literally, as if &lt; had been seen.

Actual behavior

not ok 1 - fragment contains no tags, just a comment
#   Failed test 'fragment contains no tags, just a comment'
#   at mojo-dom-bug-9.pl line 10.
#     Structures begin differing at:
#          $got->[0] = '<script then="&lt;!--"> </script>'
#     $expected->[0] = Does not exist
1..1
# Looks like you failed 1 test of 1.
commented

The following change would implement the correct (at least for HTML) behavior:

diff --git lib/Mojo/DOM/HTML.pm lib/Mojo/DOM/HTML.pm
index e10b1532d..81f54c014 100644
--- lib/Mojo/DOM/HTML.pm
+++ lib/Mojo/DOM/HTML.pm
@@ -36,8 +36,10 @@ my $TOKEN_RE = qr/
     |
       \?(.*?)\?                                                                # Processing Instruction
     |
-      \s*((?:\/\s*)?[^<>\s\/0-9.\-][^<>\s\/]*\s*(?:(?:$ATTR_RE){0,32766})*+)   # Tag
+      (\/?[^<>\s\/0-9.\-][^<>\s\/]*\s*(?:(?:$ATTR_RE){0,32766})*+)             # Tag
     )>
+  |
+    <\/ (?![a-z]) ([^>]*) >                                                    # Invalid-first-character-of-tag-name error (bogus comment)
   |
     (<)                                                                        # Runaway "<"
   )??
@@ -101,12 +103,15 @@ sub parse {
   my $xml     = $self->xml;
   my $current = my $tree = ['root'];
   while ($html =~ /\G$TOKEN_RE/gcso) {
-    my ($text, $doctype, $comment, $cdata, $pi, $tag, $runaway) = ($1, $2, $3, $4, $5, $6, $11);
+    my ($text, $doctype, $comment, $cdata, $pi, $tag, $bogus_comment, $runaway) = ($1, $2, $3, $4, $5, $6, $11, $12);
 
     # Text (and runaway "<")
     $text .= '<'                                 if defined $runaway;
     _node($current, 'text', html_unescape $text) if defined $text;
 
+    # Malformed end tag
+    $comment = $bogus_comment if length $bogus_comment;
+
     # Tag
     if (defined $tag) {
 

That is:

  • < foo> treated as &lt; foo&gt; (invalid-first-character-of-tag-name)
  • </ foo> treated as <!-- foo--> (invalid-first-character-of-tag-name)
  • </> ignored (missing-end-tag-name)

The problem is all the tests in t/mojo/dom.t that rely on the previous behavior. There's so many, I'm not sure what to do about them.

If tests are wrong then they should be fixed.

commented

The problem is stuff like this:

subtest 'XML name characters' => sub {
  my $dom = Mojo::DOM->new->xml(1)->parse('<Foo><1a>foo</1a></Foo>');
  is $dom->at('Foo')->text, '<1a>foo</1a>',                        'right text';
  is "$dom",                '<Foo>&lt;1a&gt;foo&lt;/1a&gt;</Foo>', 'right result';

  $dom = Mojo::DOM->new->xml(1)->parse('<Foo><.a>foo</.a></Foo>');
  is $dom->at('Foo')->text, '<.a>foo</.a>',                        'right text';
  is "$dom",                '<Foo>&lt;.a&gt;foo&lt;/.a&gt;</Foo>', 'right result';

  $dom = Mojo::DOM->new->xml(1)->parse('<Foo><.>foo</.></Foo>');
  is $dom->at('Foo')->text, '<.>foo</.>',                        'right text';
  is "$dom",                '<Foo>&lt;.&gt;foo&lt;/.&gt;</Foo>', 'right result';

  $dom = Mojo::DOM->new->xml(1)->parse('<Foo><-a>foo</-a></Foo>');
  is $dom->at('Foo')->text, '<-a>foo</-a>',                        'right text';
  is "$dom",                '<Foo>&lt;-a&gt;foo&lt;/-a&gt;</Foo>', 'right result';

  $dom = Mojo::DOM->new->xml(1)->parse('<Foo><a1>foo</a1></Foo>');
  is $dom->at('Foo a1')->text, 'foo',                     'right text';
  is "$dom",                   '<Foo><a1>foo</a1></Foo>', 'right result';

  $dom = Mojo::DOM->new->xml(1)->parse('<Foo><a .b -c 1>foo</a></Foo>');
  is $dom->at('Foo')->text, '<a .b -c 1>foo',                  'right text';
  is "$dom",                '<Foo>&lt;a .b -c 1&gt;foo</Foo>', 'right result';

  $dom = Mojo::DOM->new->xml(1)->parse('<πŸ˜„ πŸ˜„="πŸ˜„">foo</πŸ˜„>');
  is $dom->at('πŸ˜„')->text, 'foo',              'right text';
  is "$dom",              '<πŸ˜„ πŸ˜„="πŸ˜„">foo</πŸ˜„>', 'right result';

  $dom = Mojo::DOM->new->xml(1)->parse('<こんにけは こんにけは="こんにけは">foo</こんにけは>');
  is $dom->at('こんにけは')->text, 'foo',                              'right text';
  is "$dom",                  '<こんにけは こんにけは="こんにけは">foo</こんにけは>', 'right result';
};

It specifically tests for "incorrect" (for HTML) behavior of the parser. I don't know enough about XML to say whether this is correct for XML, but if so, you might need different tokenizers for HTML and XML. :-/


PS: In HTML mode, the correct parse for

<πŸ˜„ πŸ˜„="πŸ˜„">foo</πŸ˜„>

would be

lt;πŸ˜„ πŸ˜„="πŸ˜„"&gt;foo<!--πŸ˜„-->

I don't see the relation between < foo and <πŸ˜„ πŸ˜„="πŸ˜„">. And i'm pretty sure the latter is valid XML, since @mojojs/dom is more strict about what code points to allow for names (based on the XML spec). A separate tokenizer is out of the question, but i don't see the harm in being a bit more relaxed with names anyway. Using the same character ranges as @mojojs/dom would of course be an option.

commented

Recall:

After seeing <, we are in the tag open state. The semantically relevant characters that can follow are !, /, ASCII alpha, ?, and EOF. Anything else (including spaces) triggers an invalid-first-character-of-tag-name error.

Like space, πŸ˜„ is not !, /, ?, or ASCII alpha, so it is a parse error.

Consider this example:

<πŸ˜„ title="<script>console.log('hi');</script>"></πŸ˜„>

According to the HTML5 spec, this contains a script element because it parses like

&lt;πŸ˜„ title=&quot;<script>console.log('hi');</script>&quot;&gt;<!--πŸ˜„-->

But Mojo::DOM doesn't see the <script>.

This is essentially unfixable: Browsers will always see a different document structure than Mojo::DOM as long as tag names can start with non-ascii-alpha characters.

That is certainly an interesting case. πŸ€”

I think the HTML/XML overlap is where we draw the line with correctness, and this will remain a case we handle like it was XML. But we can still make other cases that do not conflict with XML more strict.