devongovett / regexgen

Generate regular expressions that match a set of strings

Home Page:https://runkit.com/npm/regexgen

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Wrong regular expressions with regard to unicode codepoints

pemistahl opened this issue · comments

Hi, even though I doubt that you will provide bug fixes and updates after two years of inactivity, I want you to know about the following issues with escaped and unescaped unicode codepoints:

// 1. correct
Input : "I ♥ cake"
Output: /I \u2665 cake/
Proof : "I ♥ cake".match(/I \u2665 cake/)
Result: Array [ "I ♥ cake" ] // OK

// 2. correct
Input : "I \\u2665 cake"
Output: /I \\u2665 cake/
Proof : "I \\u2665 cake".match(/I \\u2665 cake/)
Result: Array [ "I \\u2665 cake" ] // OK

// 3. failure
Input : "I \u2665 cake"
Output: /I \\u2665 cake/
Proof : "I \u2665 cake".match(/I \\u2665 cake/)
Result: null // OOPS! 
Expected Output: /I \u2665 cake/

// 4. failure
Input : "I \u{2665} cake"
Output: /I \\u{2665} cake/
Proof : "I \u{2665} cake".match(/I \\u{2665} cake/)
Result: null // OOPS! 
Expected Output: /I \u2665 cake/

// 5. failure
Input : "I \\u{2665} cake"
Output: /I \\u{2665} cake/
Proof : "I \\u{2665} cake".match(/I \\u{2665} cake/)
Result: null // OOPS! 
Expected Output: /I \\u\{2665\} cake/

Is there any chance for you to fix these issues? Thanks in advance.

How are you providing the input? When I try the latest version, I see the correct result: https://runkit.com/embed/klngsd4jwj5m

const r1 = regexgen(["I \u2665 cake"]); // /I \u2665 cake/
console.log("I \u2665 cake".match(r1)); // ["I ♥ cake"]

As an aside, I don't think that opening with a passive-aggressive sentence on someone's spare-time project is the best way to get your open source issues looked at.

@gilmoreorless I forgot to mention that I was using the CLI which produce the erroneous results above.

$ regexgen "I \u{2665} cake"
/I \\u{2665} cake/

I'm sorry to disappoint you but I did not have any aggressive feelings when I opened this issue. I just uttered an assumption based on the fact that a lot of other open issues have not been dealt with for a long time. That's all, no emotions involved.

Apologies for misreading your intent.

The command line usage makes more sense for this issue. I'd say the problem actually lies in the difference between strings in JavaScript and the command line. When you run regexgen "blah" in the CLI, the "blah" string is first being interpreted according to the rules of the CLI, then passed to the Node process.

Bash and most other shells follow the C quoting rules which has different parsing rules for strings, depending on the quoting mechanism. Specifically, for escape sequences such as \u to work, they must be within single quotes, preceded by a $ character (reference).

This can be shown by telling node to log out the arguments it receives:

$ node -e "console.log(process.argv)" "I \u2665 cake"
[ '/full/path/to/node',
  'I \\u2665 cake' ]

$ node -e "console.log(process.argv)" 'I \u2665 cake'
[ '/full/path/to/node',
  'I \\u2665 cake' ]

$ node -e "console.log(process.argv)" $"I \u2665 cake"
[ '/full/path/to/node',
  '$I \\u2665 cake' ]

$ node -e "console.log(process.argv)" $'I \u2665 cake'
[ '/full/path/to/node',
  'I ♥ cake' ]

Therefore the input string will have to be escaped in the same way for regexgen to receive it properly:

$ regexgen 'I \u2665 cake'
/I \\u2665 cake/

$ regexgen $'I \u2665 cake'
/I \u2665 cake/

Thanks for the explanation, @gilmoreorless. But this is not nice. The CLI should take care of handling the quoting and escaping rules in the different shells. Is this possible? If so, any chance to fix this?