peggyjs / peggy

Peggy: Parser generator for JavaScript

Home Page:https://peggyjs.org/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Invalid character range: \x80-\x10fffd

jorbuedo opened this issue · comments

Hi, I'm trying to parse some cddl which has this \x80-\x10fffd wide range of characters in their spec: https://www.rfc-editor.org/rfc/rfc8610.html

It throws an Invalid character range error since the comparison is just using the first character:
https://github.com/peggyjs/peggy/blob/f3235636b1be6a9944fa8b4088f5ec45966dedef/src/parser.pegjs#L398C10-L398C10

It's like saying 80 > 100 because 8 > 1. Could support be added for ranges outside 1 byte?

Peggy matches one UTF-16 code unit at a time, not a byte or a codepoint at a time. Therefore, you want something like:

PCHAR 
  = [\x20-\x7E] 
  / [\x80-\ud7ff]
  / [\ue000-\uffff]
  / $([\ud800-\udbff] [\udc00-\udfff]) // Surrogate pair

Or better:

PCHAR 
  = [\x20-\x7E\x80-\ud7ff\ue000-\uffff]
  / $([\ud800-\udbff] [\udc00-\udfff]) // Surrogate pair

Also note that \xHH is for code units < 256 with exactly two hex digits, and \uHHHH always has four hex digits.

I hope this worked for you. If not, please re-open this issue.