_amps Regex

Question

_amps Regex

GoogleCodeExporter opened this issue 9 years ago · comments

Google Code Exporter commented 9 years ago

private static Regex _amps = new Regex(@"&(?!#?[xX]?([0-9a-fA-F]+|\w+);)", 
RegexOptions.ExplicitCapture | RegexOptions.Compiled);

The idea of this regular expressions seems to be to find all occurences of an 
ampersand that is not 
already part of an html escape sequence.

Looking at the group "([0-9a-fA-F]+|\w+)" it seems that [0-9a-f-A-F] is nothing 
more than a subset 
of \w, this is equivalent to "\w+" - the group cannot be captured anyways due 
to 
RegexOptions.ExplicitCapture.

When looking at the remaining Regex, more becomes obvious: "[xX]?\w" - As far 
as I know, html escape 
characters contain either a 'x' to signify a hex sequence or an '#' to signify 
a decimal sequence, 
therefore allowing us to write "[#xX]?", which is not the same as the original, 
but in fact reduces 
the number of false positives (e.g. &#x77; is not a valid html escape sequence 
if I am not 
mistaken).

Following these considerations, the pattern @"&(?![#xX]?\w+;)" should work at 
least as well as the 
original, while being easier to read and understand in my opinion. Also,

Original issue reported on code.google.com by Shio...@gmail.com on 31 Dec 2009 at 6:32

Google Code Exporter · Answer 1 · Fri Jul 03 2015 15:57:24 GMT+0800 (China Standard Time)

I think there are some misunderstandings here..

> [0-9a-f-A-F] is equivalent to \w+

no, because \w matches _ and A-Za-z

> the group cannot be captured anyways due to RegexOptions.ExplicitCapture

The match still occurs, you just can't get to it by saying $1 or $2 or $3 etc

> w is not a valid html escape sequence

have you tried it? produces "w" in Firefox 3.5, IE8, Chrome3..

I don't think there's any evidence this regex is wrong, as it is part of the 
core
Markdown.pl library. Can you provide actual test cases where it produces 
incorrect
results?

Original comment by wump...@gmail.com on 31 Dec 2009 at 8:04

Google Code Exporter · Answer 2 · Fri Jul 03 2015 15:57:24 GMT+0800 (China Standard Time)

I may be stupid but I do not find the line
> [0-9a-f-A-F] is equivalent to \w+
in the issue at all... Rather I am saying that [0-9a-fA-F] is INCLUDED in \w, 
therefore [0-9a-fA-F]+ is included in \w+ as well.

Assuming this is correct, the union of those sets can be computed without even 
knowing the sets, since the union of a set s with a subset of s is 
always s. This allows us the following transform:

([0-9a-fA-F]+|\w+) => (\w+)

According to Microsoft ( http://msdn.microsoft.com/en-us/library/20bw873z.aspx 
) the \w character class contains "Letter, Uppercase", "Letter, 
Lowercase" and "Number, Decimal Digit" - so I am quite certain that the 
transform is in deed equivalent to the original regular expression.

Concerning the group: I never said anything about the match not occuring, I 
said "the group cannot be captured anyways", which I would consider 
being quite true for a non-capturing group... RegexOptions.ExplicitCapture 
allows the usage of unnamed parentheses as non-capturing groups, see 
there: 
http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regexopti
ons.aspx
Since we are dealing with a unnamed group, it is non-capturing under this 
option. It also does not do any "grouping" with respect to operators in 
this case, since it only contains \w+, so it can simply be left out, dealing 
the transform:

(\w+) => \w+



About the escape sequences:
After looking through the html spec, it seems I was indeed mistaken and 
remembered it exactly the wrong way around:
The HTML 4.01 spec says in section 5.3 ( 
http://www.w3.org/TR/1999/REC-html401-19991224/charset.html ) allows the 
following two representations in 
addition to named entities:
"The syntax "&#D;", where D is a decimal number, refers to the ISO 10646 
decimal character number D.
The syntax "&#xH;" or "&#XH;", where H is a hexadecimal number, refers to the 
ISO 10646 hexadecimal character number H. Hexadecimal numbers in 
numeric character references are case-insensitive."

XHTML 1.0 ( http://www.w3.org/TR/2002/REC-xhtml1-20020801/ ) notes in section 
4.12:
"SGML and XML both permit references to characters by using hexadecimal values. 
In SGML these references could be made using either &#Xnn; or 
&#xnn;. In XML documents, you must use the lower-case version (i.e. &#xnn;)"

Therefore, in my opinion, the "correct" way to formulate the inner part of the 
regex is:

(#[xX]?)?\w+;

However, it does not really matter anyway, since this essentially means:

{all strings consisting only of word characters} \ {"#;", "#x;", "#X;"}

more interesting is the following version, that follows the specs rather 
closely:

(#[0-9]+|([xX][a-fA-F0-9]))|([a-zA-Z0-9]+)

Named character entities in HTML 4.01 ( 
http://www.w3.org/TR/1999/REC-html401-19991224/sgml/entities.html ) and XHTML 1 
( 
http://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html#h-A2 ) specs consist of 
english letters (the first is oftentimes uppercase) and in some 
cases numbers (e.g. <!ENTITY frac34 "¾"> <!-- vulgar fraction three quarters = 
fraction three quarters, U+00BE ISOnum --> )



"I don't think there's any evidence this regex is wrong, as it is part of the 
core Markdown.pl library."
The first part is not about it being "wrong" in any kind, just about doing 
equivalent transforms that do not change the meaning, while making it 
more easy to understand just what it does.
The second part however, is about what the regex is supposed to do: Find 
ampersands that are not part of an html escape sequence. The idea seems to 
be to escape any ampersand that is supposed to be one, while still allowing the 
use of escape sequences. The number of false positives will never 
be zero due to the simple fact that the different specs for (x)html differ on 
this point, but it CAN be reduced. If the design intention is to not 
escape all ampersands that are part of something that might be mistaken for an 
escape sequence: well, that is a design decision...
To ensure that I am not mistaken about html escape sequences again, I have 
tested &x77; in all browsers on my system (chrome, firefox, safari, ie 
and opera) and I DO get &x77; if I use that "escape sequence" - and not a "w". 
Also, with the original regex &x77; is not escaped...

Original comment by Shio...@gmail.com on 31 Dec 2009 at 9:28

Google Code Exporter · Answer 3 · Fri Jul 03 2015 15:57:24 GMT+0800 (China Standard Time)

I see your point on ([0-9a-fA-F]+|\w+) and I agree the other form is more 
accurate.
But there is a documentation reason it is written that way..

Two clarifications:

1) You know this is a *negative* lookahead, yes?

&(?!)

meaning, match '&' but ONLY if the rest isn't matched ahead of it.

2) it can also negatively match … not just w and m

I think the |\w+ was an attempt to document the the … case. Though as you
pointed out I doubt any named entities have _ in them, or numbers for that 
matter.

As I've said in other requests to change the regexes, there are only two 
reasons to
do so:

a) performance improves

b) fixes a bug

change for the sake of change is bad because

c) it makes us deviate from the Perl standard reference implementation regexes

which would be OK if a) or b) were true.. but they are not in this case.

Original comment by wump...@gmail.com on 1 Jan 2010 at 12:48

Google Code Exporter · Answer 4 · Fri Jul 03 2015 15:57:24 GMT+0800 (China Standard Time)

Original comment by wump...@gmail.com on 1 Jan 2010 at 12:52

Changed state: WontFix

Google Code Exporter · Answer 5 · Fri Jul 03 2015 15:57:24 GMT+0800 (China Standard Time)

1) Of course.
2) The point being that it DOES negatively match &x77; - although it should 
not. 
Named entities do have numbers, (sadly) there is i.e. frac34...
The amount of false negatives you get it humungous, but I can understand 
implementing 
a regex like (nbsp|cent|pound|copy|brvbar|......) would be most annoying (and 
slow, 
too). However, there are no entities starting with a number, nor are there any 
starting with a hash (except for the two types of numeric ones of course).

Therefore

@"&(?!(#[0-9]+)|(#[xX][a-fA-F0-9])|([a-zA-Z][a-zA-Z0-9]*);)"

drastically reduces the number of false negatives (being false positives of the 
lookahead). Maybe most importantly it catches &x77; or &172; ones, which might 
happen.

Therefor I would indeed consider this to be of kind b) - however, if being 
close to 
the perl implementation is more important than catching these kind of errors, 
well, 
that is a design decision ;)

Original comment by Shio...@gmail.com on 1 Jan 2010 at 1:27

Google Code Exporter · Answer 6 · Fri Jul 03 2015 15:57:25 GMT+0800 (China Standard Time)

I appreciate the contribution.

I will definitely keep this in mind -- the goal at the moment is to match 
markdown.pl
1.0.2b8 very closely. If in the future we start deviating (and we might, since 
John
Gruber seems utterly uninterested in any public work on this since 2004), then I
agree this is a better and more accurate regex.

Still kind of a narrow and extremely minor "bug", though. I'd prefer some help 
with a
bigger and much more severe bug -- the HTML block matching algorithms in 
1.0.2b8 :)

Original comment by wump...@gmail.com on 1 Jan 2010 at 2:01

Google Code Exporter · Answer 7 · Fri Jul 03 2015 15:57:25 GMT+0800 (China Standard Time)

ok, now that we have been forced to absorb parts of PHP Markdown (because it is
better maintained), and thus deviating from the "canon" of the 1.0.2b8 Perl 
version
.. I am open to making this small change. reopening

Original comment by wump...@gmail.com on 6 Jan 2010 at 11:47

Changed state: Accepted

Google Code Exporter · Answer 8 · Fri Jul 03 2015 15:57:25 GMT+0800 (China Standard Time)

checked into r81

Original comment by wump...@gmail.com on 7 Jan 2010 at 12:13

Changed state: Fixed