Wrong regular expressions with regard to unicode codepoints

Question

Wrong regular expressions with regard to unicode codepoints

pemistahl opened this issue 5 years ago · comments

Hi, even though I doubt that you will provide bug fixes and updates after two years of inactivity, I want you to know about the following issues with escaped and unescaped unicode codepoints:

// 1. correct
Input : "I ♥ cake"
Output: /I \u2665 cake/
Proof : "I ♥ cake".match(/I \u2665 cake/)
Result: Array [ "I ♥ cake" ] // OK

// 2. correct
Input : "I \\u2665 cake"
Output: /I \\u2665 cake/
Proof : "I \\u2665 cake".match(/I \\u2665 cake/)
Result: Array [ "I \\u2665 cake" ] // OK

// 3. failure
Input : "I \u2665 cake"
Output: /I \\u2665 cake/
Proof : "I \u2665 cake".match(/I \\u2665 cake/)
Result: null // OOPS! 
Expected Output: /I \u2665 cake/

// 4. failure
Input : "I \u{2665} cake"
Output: /I \\u{2665} cake/
Proof : "I \u{2665} cake".match(/I \\u{2665} cake/)
Result: null // OOPS! 
Expected Output: /I \u2665 cake/

// 5. failure
Input : "I \\u{2665} cake"
Output: /I \\u{2665} cake/
Proof : "I \\u{2665} cake".match(/I \\u{2665} cake/)
Result: null // OOPS! 
Expected Output: /I \\u\{2665\} cake/

Is there any chance for you to fix these issues? Thanks in advance.

Gilmore Davidson · Answer 1 · Sun Oct 06 2019 18:31:37 GMT+0800 (China Standard Time)

How are you providing the input? When I try the latest version, I see the correct result: https://runkit.com/embed/klngsd4jwj5m

const r1 = regexgen(["I \u2665 cake"]); // /I \u2665 cake/
console.log("I \u2665 cake".match(r1)); // ["I ♥ cake"]

As an aside, I don't think that opening with a passive-aggressive sentence on someone's spare-time project is the best way to get your open source issues looked at.

Peter M. Stahl · Answer 2 · Mon Oct 07 2019 02:58:25 GMT+0800 (China Standard Time)

@gilmoreorless I forgot to mention that I was using the CLI which produce the erroneous results above.

$ regexgen "I \u{2665} cake"
/I \\u{2665} cake/

I'm sorry to disappoint you but I did not have any aggressive feelings when I opened this issue. I just uttered an assumption based on the fact that a lot of other open issues have not been dealt with for a long time. That's all, no emotions involved.

Gilmore Davidson · Answer 3 · Mon Oct 07 2019 18:21:03 GMT+0800 (China Standard Time)

Apologies for misreading your intent.

The command line usage makes more sense for this issue. I'd say the problem actually lies in the difference between strings in JavaScript and the command line. When you run regexgen "blah" in the CLI, the "blah" string is first being interpreted according to the rules of the CLI, then passed to the Node process.

Bash and most other shells follow the C quoting rules which has different parsing rules for strings, depending on the quoting mechanism. Specifically, for escape sequences such as \u to work, they must be within single quotes, preceded by a $ character (reference).

This can be shown by telling node to log out the arguments it receives:

$ node -e "console.log(process.argv)" "I \u2665 cake"
[ '/full/path/to/node',
  'I \\u2665 cake' ]

$ node -e "console.log(process.argv)" 'I \u2665 cake'
[ '/full/path/to/node',
  'I \\u2665 cake' ]

$ node -e "console.log(process.argv)" $"I \u2665 cake"
[ '/full/path/to/node',
  '$I \\u2665 cake' ]

$ node -e "console.log(process.argv)" $'I \u2665 cake'
[ '/full/path/to/node',
  'I ♥ cake' ]

Therefore the input string will have to be escaped in the same way for regexgen to receive it properly:

$ regexgen 'I \u2665 cake'
/I \\u2665 cake/

$ regexgen $'I \u2665 cake'
/I \u2665 cake/

Peter M. Stahl · Answer 4 · Thu Oct 10 2019 20:53:06 GMT+0800 (China Standard Time)

Thanks for the explanation, @gilmoreorless. But this is not nice. The CLI should take care of handling the quoting and escaping rules in the different shells. Is this possible? If so, any chance to fix this?