Escaping meta-characters in regular expressions - not handled.

Question

Escaping meta-characters in regular expressions - not handled.

bFraley opened this issue 10 years ago · comments

I have written quite a lot below, so before I complicate the issue, the problem seems to be:
Escaped characters within regular expression syntax are not being recognized,
obviously they're being escaped - even in brackets or as strings, or when escaping the escape.

I have put in a few hours researching this, but I am inexperienced using RegExp for anything other than simple character or word matching. Also, my prior use has been mainly in a browser environment.

For this reasons, I am making no assumptions or conclusions as to where the heart of this bug is.
After researching and testing many approaches for use of escape characters in a regex, as well as within uses for .split(), I can conclude that when looking for \n or \s, or even when escaping the escape character as in \\n, or within brackets like /[\n]/, or /[\\n]/, no approach will return a correct split at a new line character, or space, or other escaped characters.

String literals work in String.split just fine. "hello\nhello on next line".split("\n") works.
String.match seems to work just fine as well.

It is when a regex object, or literal RegExp("/ab\n/") is used with meta-characters that need escaping that doesn't properly work.

I am too unfamiliar with the regexp.js code and some of the more advanced regex syntax to offer much more on this issue. Alhough, I have tried many types of syntax for escaping these characters, and they work in a browser environment.

This issue could go beyond edge cases not handled within Higgs, but may even be related to POSIX specific standards for regex.

It seems that handling any and all cases for regex standards is a large and complicated endeavor,
but perhaps we can get to where Higgs' runs the regexp benchmark. Extreme cases may just have to be hacked out by the user, as it seems has to be done in many languages and regex engines anyway. Perhaps there is a more simple solution or cause to the problem that is staring me in the face, but I have no idea.

I am leaving this for the pros to disinfect! @maximecb @tach4n

Here are some links that may be of interest:
http://www.regular-expressions.info/posixbrackets.html
http://www.regular-expressions.info/refcharclass.html

Do we 'do' brackets ?

"Generally, only POSIX-compliant regular expression engines have proper and full support for POSIX bracket expressions. Some non-POSIX regex engines support POSIX character classes, but usually don't support collating sequences and character equivalents. Regular expression engines that support Unicode use Unicode properties and scripts to provide functionality similar to POSIX bracket expressions."

Does Higgs handle escaping arithmetic characters in regular expressions ?

"If you want to use any of these characters as a literal in a regex, you need to escape them with a backslash. If you want to match 1+1=2, the correct regex is 1+1=2. Otherwise, the plus sign has a special meaning."

I will explore this further and offer more code, as my time can permit.

Anyone going in for the kill ?

Brett · Answer 1 · Sat Aug 16 2014 00:23:54 GMT+0800 (China Standard Time)

err! had to escape the escape in the 1+1=2 example for it to display correctly...as it goes

Maxime Chevalier-Boisvert · Answer 2 · Sat Aug 16 2014 01:39:50 GMT+0800 (China Standard Time)

Did you test this on Google V8?

I suspect the problem is either in the lexer, or in the regex engine itself. Possibly, the string the lexer passes to the regex engine is escaped too early, or not escaped properly. Would need to log what goes into the regexp constructor, maybe try building regexp with RegExp("str") and see if that behaves differently.

If you have time, could you send a PR with some regexps that don't behave properly? We could add that to our regression tests. As usual, thanks for taking the time to investigate things. These shameful bugs need fixing!

Brett · Answer 3 · Mon Aug 25 2014 06:55:48 GMT+0800 (China Standard Time)

I've tested in Chrome and Node. Node uses a V8 as-is ? Shouldn't be a difference in results?

With the variations in ways to form a regex, Higgs works as expected in many cases.
I believe I've narrowed this bug down to it's roots. I have used the \s (space) character here, but the results are the same for other escaped special characters.

While below I've used new RegExp( regex ), if you use a plain statement like:
x = / \s / or y = /\n/ Then all is well, until you check the length of the array returned.
It will return a length of 1, even with using the g or + modifiers like: /\s/g or /\s+/.

var spaces1 = new RegExp(/\s/);
var spaces2 = new RegExp("\\s"); // Should never even work, but see spaces2 notes below
var spaces3 = new RegExp(/\s/g);

Shown below, when using .split() with the test pattern spaces1, Higgs throws an error.
The same error is thrown on spaces3. This is because the g modifier is being ignored, or missed, and it is seen by Higgs as exactly the same as the pattern for spaces1.

For .match(), Higgs returns the expected results for spaces1 and spaces2, with no errors.
However, again when the g modifier is used as in spaces3 - .match() will only return the first match found, ignoring the modifier.

var hello = "hello hello hello";

hello.split(spaces1);

TypeError: toString produced non-primitive value
 $rt_throwExc(2EF3) ("/etc/higgs/runtime/runtime.js"@134:1)
 $rt_toString(2F4D) ("/etc/higgs/runtime/runtime.js"@424:31)
$rt_add(2FB0) ("/etc/higgs/runtime/runtime.js"@762:27)
string_split(7EA0) ("/etc/higgs/stdlib/string.js"@605:15)
repl(1E2EA) ("repl"@1:1)

hello.split(spaces2);

 /* @spaces2  Does not split at characters, returns array length of 1.
This is because it isn't even searching with a pattern.
This pattern should not work, but it does in V8 because it returns the
 literal regex: / \s / from the new RegExp call. */

hello.split(spaces3)    // ==>  Same error message as spaces1

I am still trying to understand this myself. I've been through a lot of code, trying to simplify the problem. There is ambiguity over string vs object representation of regular expression patterns, and at the points where a pattern is converted to one or the other in Higgs - Like mentioned above by @maximecb.

The regex above examples all work in Chrome / V8 / Node / FireFox.

I will certainly get some working regression test code going.

Brett · Answer 4 · Mon Aug 25 2014 06:57:15 GMT+0800 (China Standard Time)

Everything above can be summed up as:

Figure out:

TypeError: toString produced non-primitive value
Capturing modifiers needs fixed. v8 regexp benchmark halts when it hits that first /g
error is: RegExp: global property not defined "g"
Then we can see if splitting at escaped characters will produce correct array lengths.

Maxime Chevalier-Boisvert · Answer 5 · Tue Aug 26 2014 02:04:08 GMT+0800 (China Standard Time)

I'm confused myself, I think there's a few separate issues here. Potentially something wrong in the String.prototype.split implementation, and something wrong with the global modifier.

If you have time, maybe try finding the simplest one-liner expressions that work in V8 but fail in Higgs.

Is it possible that I screwed up the regex literal parsing and am not captuing the modifier properly? That would be in the lexer code.

yawnt · Answer 6 · Sun Sep 21 2014 02:28:22 GMT+0800 (China Standard Time)

okay, so.. first bug is in String.prototype.split (precisely here).
String.prototype.split should accept a) a string b) a RegExp .. but that line coherces the regex to a string which is why

node> "hello hello".split(/\s/)
[ 'hello', 'hello' ]
higgs> "hello hello".split(/\s/)
[ 'hello hello' ]

and

higgs> "hello\\shello".split(/\s/)
[ 'hello', 'hello' ]

(because /\s/ coherces to string is \s)

i'll keep working on the other sides of this bug and prepare a PR

Brett · Answer 7 · Sun Sep 21 2014 02:43:49 GMT+0800 (China Standard Time)

h> hello ="hello hello".split(/\s/);
array
h> hello.length;
1
h> hello = "hello\nhello\nhello".split(/\n/);
array
h> hello.length;
1
h>

Brett · Answer 8 · Sun Sep 21 2014 04:22:20 GMT+0800 (China Standard Time)

@yawnt @maximecb I can confirm that PR #143 fixes string.split, and even + and g modifiers.

EDIT initial example: Shouldn't work anyway.
new RegExp("in quotes isn't valid syntax");

The concern is:
new RegExp( )

h> var reg = new RegExp(/\s/);      
h> hello.split(reg);

heap space exhausted, expanding heap to 32MiB
heap space exhausted, expanding heap to 64MiB
heap space exhausted, expanding heap to 128MiB
^Z
[1]+  Stopped                 higgs

Maxime Chevalier-Boisvert · Answer 9 · Sun Sep 21 2014 04:26:15 GMT+0800 (China Standard Time)

There must be some sort of infinite loop/recursion happening?

The fix for the latter would probably be to return the regexp directly when one is passed to the RegExp constructor.

yawnt · Answer 10 · Sun Sep 21 2014 19:04:19 GMT+0800 (China Standard Time)

it's the opposite.. new RegExp("ohai") is valid, while new RegExp(/ohai/) is not

and new RegExp("string" actually works (remember you have to double escape backslashes)

h> x = "hello hello hello"
h> x.split(new RegExp("\s"))
["hello","hello","hello"]

Brett · Answer 11 · Mon Sep 22 2014 00:02:42 GMT+0800 (China Standard Time)

Thanks @yawnt. I was mistaken because new RegExp(/\s/) works correctly in browsers, v8, and node.
I did have it backwards. So, new RegExp("/\s/") is literally searching for "/ /" in a string!

Your PR seems to fix this issue entirely then.
The incorrect new RegExp(/not a string/) will need it's own minor fix, or error message ` ?

Maxime Chevalier-Boisvert · Answer 12 · Mon Sep 22 2014 00:07:56 GMT+0800 (China Standard Time)

As I was saying, new RegExp(/not a string/) should return the regexp expression directly. It's a very easy fix.