googlearchive / code-prettify

An embeddable script that makes source-code snippets in HTML prettier.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

XML attribute value with ">" breaks syntax highlighting

boghyon opened this issue · comments

This issue is similar to #339, but this time it's about > instead of <.

  • According to the XML specification, the left angle bracket (<) MUST be escaped. (no problem)
  • The right angle bracket (>), however, doesn't need to be.

    The right angle bracket (>) may be represented using the string " > ", and MUST, for compatibility, be escaped

Borrowing @amroamroamro's example, you can see here that this document is valid

<?xml version="1.0"?>
<Person AgeCategory=">3" ></Person>

right angle bracket is a valid character in xml

Using prettify, the highlighting gets unfortunately broken.

sample broken highlighting
Source: OpenUI5 Walkthrough

commented

You are right, <Person AgeCategory=">3" ></Person> is valid XML/HTML.
So this is a bug.


FYI, here's the part that handles HTML/XML markup:

https://github.com/google/code-prettify/blob/453bd5f51e61245339b738b1bbdd42d7848722ba/js-modules/prettify.js#L700-L739

And the offending regular expression that matches tags is this one:

['lang-in.tag',  /^(<\/?[a-z][^<>]*>)/i]

The pattern captured by this is then forwarded to the 'lang-in.tag' handler which in turn executes on the parts inside, to decorates the tokens inside by its own rules like:

[PR_ATTRIB_VALUE, /^(?:\"[^\"]*\"?|\'[^\']*\'?)/, null, '\"\'']
[PR_TAG,          /^^<\/?[a-z](?:[\w.:-]*\w)?|\/?>$/i]
[PR_ATTRIB_NAME,  /^(?!style[\s=]|on)[a-z](?:[\w:-]*\w)?/i]

Given the first regexp above /^(<\/?[a-z][^<>]*>)/i, you can see how it would correctly match something like <tag name="val">, but breaks for something like <tag name=">val">:

https://regexr.com/412kc

Hence why you must escape < and > inside attribute values, really for code-prettify's sake, not the W3C specs :)

commented

I feel like this should be mentioned in a FAQ somewhere; code-prettify does not implement a full-blown parser, it simply attempts to do syntax highlighting using regular expressions. I say "attempt" because it cannot correctly highlight every piece of code using only regexps. But on the web and for the purpose of presenting snippets of code, small highlighting errors are usually acceptable given the speed and small-size gains compared to implementing a full parser for every language supported.