z7zmey / php-parser

PHP parser written in Go

Home Page:https://php-parser.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Unknown unicode characters in inline HTML cause syntax errors

imuli opened this issue · comments

When parsing

hi 󰀄 bye

I get

==> plane_15.php
syntax error: unexpected $unk at line 1
  | [*node.Root]
  |   "Position": Pos{Line: 1-1 Pos: 1-12};
  |   "Stmts":
  |     [*stmt.InlineHtml]
  |       "Position": Pos{Line: 1-1 Pos: 1-3};
  |       "Value": hi ;
  |     [*stmt.InlineHtml]
  |       "Position": Pos{Line: 1-1 Pos: 9-12};
  |       "Value": bye
;

rather than

==> plane_15.php
  | [*node.Root]
  |   "Position": Pos{Line: 1-1 Pos: 1-12};
  |   "Stmts":
  |     [*stmt.InlineHtml]
  |       "Position": Pos{Line: 1-1 Pos: 1-12};
  |       "Value": hi 󰀄 bye
;

The character in there is U+F0004, in Supplemental Private Use Area-A, commonly used with custom fonts for rendering charactcer like things in text on the web.

I'll submit a pull request with the fix, which simply seperates EOF from other uncategorized characters in the classifier.