Invalid character range: \x80-\x10fffd
jorbuedo opened this issue · comments
Hi, I'm trying to parse some cddl which has this \x80-\x10fffd
wide range of characters in their spec: https://www.rfc-editor.org/rfc/rfc8610.html
It throws an Invalid character range
error since the comparison is just using the first character:
https://github.com/peggyjs/peggy/blob/f3235636b1be6a9944fa8b4088f5ec45966dedef/src/parser.pegjs#L398C10-L398C10
It's like saying 80 > 100 because 8 > 1. Could support be added for ranges outside 1 byte?
Peggy matches one UTF-16 code unit at a time, not a byte or a codepoint at a time. Therefore, you want something like:
PCHAR
= [\x20-\x7E]
/ [\x80-\ud7ff]
/ [\ue000-\uffff]
/ $([\ud800-\udbff] [\udc00-\udfff]) // Surrogate pair
Or better:
PCHAR
= [\x20-\x7E\x80-\ud7ff\ue000-\uffff]
/ $([\ud800-\udbff] [\udc00-\udfff]) // Surrogate pair
Also note that \xHH
is for code units < 256 with exactly two hex digits, and \uHHHH
always has four hex digits.
I hope this worked for you. If not, please re-open this issue.