kswoll / npeg

This parser is an implementation of a Packrat Parser with support for left-recursion. The algorithm for left recursion is a modified version of Packrat parsers can support left recursion.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Problem with Repeat

xanatos opened this issue · comments

I'm trying to create an importer for the unicode NameList.txt (http://www.unicode.org/Public/UNIDATA/NamesList.html the grammar and https://www.unicode.org/Public/UCD/latest/ucd/NamesList.txt the data). I've been able to do it using your peg 2.0.0 library but there are two bugs in your library/features I didn't comprehend that I had to work around.

The Repeat() extension method seems to "eat" a character even when it fails (it doesn't backtrack).

Source code: KlcImporter.zip

To reproduce: run the program. It will generate a output.txt . You can compare it with the original NameList.txt (for example with WinMerge) and they should be equal. Now replace

    public virtual Expression Char() => X() + X() + X() + X() + ~(X() + ~X());

with

    public virtual Expression Char() => X().Repeat(4, 6);

Re-run the program. Now the files are different ("Danish, Norwegian, Swedish, Walloon" becomes "anish, Norwegian, Swedish, Walloon"). A single character is "eaten" in the ExpandLineContainerElement() expression by the Repeat(), and even when it fails it isn't returned to the stream. The sequence should be: EscChar() fails, Char() begins, reads one character, fails, doesn't backtrack the read character (error), String() reads one less character.

P.S. I've added a

static CharacterSet To(this char from, char to, Func<char, bool> predicate)

extension overload . I feel it would be a good addiction to the ones you already implement. It is useful for implementing rules like "<sequence of characters in the range U+0020..U+02FF, except controls>"