neogeny / TatSu

竜 TatSu generates Python parsers from grammars in a variation of EBNF

Home Page:https://tatsu.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Support repetition qualifiers for closures

mgrazebrook opened this issue · comments

Could you support:
rule = {expression}{7} ;
or
rule = {expression}{2,5} ;

Example from the re syntax:
https://docs.python.org/3/library/re.html#regular-expression-syntax, search for "Repetition qualifiers"

I'm sometimes parsing log files, textified pdf, scanned docs or other things not designed to be parsed. One of the reasons I like TatSu for this is you can be sure you really understood the format within a section and can occasionally explain what you're doing to a non-programmer. In contrast when I do the same with regular expressions, I sometimes find myself silently skipping bits (and it's very hard to read!). Such formats often have fixed numbers of repetitions - and it's interesting to know if ones assumption always holds about the number of repetitions.

Also one sometimes gets cases where you have a repetitions followed by up to b repetitions followed by c repetitions where each group is of a different kind - possibly a harder case to manage.

rule = {int}{4} {int}{2,4} {int}{2} ;

Of course I can just measure the list length in semantics, but I feel this is more properly part of the grammar. So this is low priority.

I think is this a good idea!

The syntax would have to be different, non regex-like, because TatSu already defines {} (and also () and []). There's already a lot of syntax around {}.

Perhaps it could be:

rule = {int}<4> {int}<2,4> {int}<2> ;

I think that TatSu only allows * after {}, so the new syntax could also be:

rule = int*4 int*2-4  (int string)*2 ;

We need to review the current syntax to choose a new one that makes the intention clear and doesn't collide with current semantics.

We should probably first provide an implementation, and decide about the syntax after.

I just spent half an hour trying to find out what other syntaxes do and the only one I could find was 're'! To be fair, it's probably the only repetition qualifier most of your users know. And I understand you reason for rejecting it.

It may be necessary to constrain it so that a sequence of repetition qualifiers can only include one range. So:
rule = int*4 int*2:4 int*2:5 int*3
might not be allowed or might be formally determined so the LHS or RHS is greedy.

Did you notice I experimented with a colon in 2:4? I thought it had a more Pythonic flavour, though repetition isn't much like a slice. Of the two you offer, I mostly like the latter but found the '-' sign grated a little because my mind needs it to be subtraction. Too bad elipsis isn't on a standard keyboard.