XML attribute value with ">" breaks syntax highlighting

Question

XML attribute value with ">" breaks syntax highlighting

boghyon opened this issue 6 years ago · comments

This issue is similar to #339, but this time it's about > instead of <.

According to the XML specification, the left angle bracket (<) MUST be escaped. (no problem)
The right angle bracket (>), however, doesn't need to be.

The right angle bracket (>) may be represented using the string " > ", and MUST, for compatibility, be escaped

Borrowing @amroamroamro's example, you can see here that this document is valid

<?xml version="1.0"?>
<Person AgeCategory=">3" ></Person>

Using prettify, the highlighting gets unfortunately broken.

Source: OpenUI5 Walkthrough

Amro · Answer 1 · Thu Oct 11 2018 20:52:56 GMT+0800 (China Standard Time)

You are right, <Person AgeCategory=">3" ></Person> is valid XML/HTML.
So this is a bug.

FYI, here's the part that handles HTML/XML markup:

https://github.com/google/code-prettify/blob/453bd5f51e61245339b738b1bbdd42d7848722ba/js-modules/prettify.js#L700-L739

And the offending regular expression that matches tags is this one:

['lang-in.tag',  /^(<\/?[a-z][^<>]*>)/i]

The pattern captured by this is then forwarded to the 'lang-in.tag' handler which in turn executes on the parts inside, to decorates the tokens inside by its own rules like:

[PR_ATTRIB_VALUE, /^(?:\"[^\"]*\"?|\'[^\']*\'?)/, null, '\"\'']
[PR_TAG,          /^^<\/?[a-z](?:[\w.:-]*\w)?|\/?>$/i]
[PR_ATTRIB_NAME,  /^(?!style[\s=]|on)[a-z](?:[\w:-]*\w)?/i]

Given the first regexp above /^(<\/?[a-z][^<>]*>)/i, you can see how it would correctly match something like <tag name="val">, but breaks for something like <tag name=">val">:

https://regexr.com/412kc

Hence why you must escape < and > inside attribute values, really for code-prettify's sake, not the W3C specs :)

Amro · Answer 2 · Thu Oct 11 2018 20:56:06 GMT+0800 (China Standard Time)

I feel like this should be mentioned in a FAQ somewhere; code-prettify does not implement a full-blown parser, it simply attempts to do syntax highlighting using regular expressions. I say "attempt" because it cannot correctly highlight every piece of code using only regexps. But on the web and for the purpose of presenting snippets of code, small highlighting errors are usually acceptable given the speed and small-size gains compared to implementing a full parser for every language supported.