neogeny / TatSu

竜 TatSu generates Python parsers from grammars in a variation of EBNF

Home Page:https://tatsu.readthedocs.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Regex broken after v5.7.3

pressureless opened this issue · comments

The output for the following code with v 5.7.3 is:

tatsu: 5.7.3
# SIMPLE PARSE
# AST
(   {'value': 'a1'},
    '+',
    {'value': 'a2'})

# JSON
[
    {
        "value": "a1"
    },
    "+",
    {
        "value": "a2"
    }
]

The output for the same code with v 5.8.3 is:

tatsu: 5.8.3
# SIMPLE PARSE
# AST
(   {'value': '1'},
    '+',
    {'value': '2'})

# JSON
[
    {
        "value": "1"
    },
    "+",
    {
        "value": "2"
    }
]
import json
from pprint import pprint
import tatsu
print("tatsu: {}".format(tatsu.__version__))

GRAMMAR=r"""@@grammar::CALC


start
    =
    expression $
    ;


expression
    =
    | expression '+' ~ term
    | expression '-' ~ term
    | term
    ;


term
    =
    | term '*' ~ factor
    | term '/' ~ factor
    | factor
    ;


factor
    =
    | '(' ~ expression ')'
    | number
    ;

number
    = value:/[A-Za-z]([A-Za-z0-9]*)/
    ;"""


def simple_parse():
    grammar = GRAMMAR

    parser = tatsu.compile(grammar)
    ast = parser.parse('a1 + a2')

    print('# SIMPLE PARSE')
    print('# AST')
    pprint(ast, width=20, indent=4)

    print()

    print('# JSON')
    print(json.dumps(ast, indent=4))


if __name__ == '__main__':
    simple_parse()

This is an intentional change, see https://github.com/neogeny/TatSu/blob/v5.8.3/CHANGELOG.rst#580--2022-03-12

Honor grouping in pattern expressions with the semantics of re.findall(pattern, text)[0]. Now groups that should not be returned when parsing should use the (?:) syntax.

Following the change log entry:

>>> re.findall('[A-Za-z]([A-Za-z0-9]*)', 'a1')[0]
'1'

You need to modify the regular expression as follows:

>>> re.findall('[A-Za-z](?:[A-Za-z0-9]*)', 'a1')[0]
'a1'

or just drop the grouping, as it does not seem to be needed in this regular expression:

>>> re.findall('[A-Za-z][A-Za-z0-9]*', 'a1')[0]
'a1'

I saw that change, but didn't get it. Thank you for the clarification!