pyparsing / pyparsing

Python library for creating PEG parsers

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RecursionError with Forward syntax

RobinetDenisAcsone opened this issue · comments

Follow up of #489 (comment)
Summary: Having a syntax with nested function can cause a RecursionError based on the number of nesting

@ptmcg As you asked, here is the syntax code: https://github.com/Arelle/Arelle/blob/master/arelle/formula/XPathParser.py#L799

A bit of context of use:
Arelle is a software that handle a specification called XBRL, which allow reporting data based on a taxonomy (which define what kind of data) created by an authority.

In one of the latest version of a taxonomy (solvency 2.8) there are a few formulas (used to validate the data) which looked like this:

iaf:numeric-equal($v1,iaf:numeric-add(iaf:numeric-add(iaf:numeric-add(iaf:numeric-add(iaf:numeric-add(iaf:numeric-add(iaf:numeric-add(iaf:numeric-add(iaf:numeric-add(iaf:numeric-add(iaf:numeric-add(iaf:numeric-add(iaf:numeric-add(iaf:numeric-add(iaf:numeric-add(iaf:numeric-add(iaf:numeric-add(iaf:numeric-add(iaf:numeric-add(iaf:numeric-add(iaf:numeric-add(iaf:numeric-add(iaf:numeric-add(iaf:numeric-add(iaf:numeric-add(iaf:numeric-add(iaf:numeric-add(iaf:numeric-add(iaf:numeric-add(iaf:numeric-add(iaf:numeric-add($v2,$v3),$v4),$v5),$v6),$v7),$v8),$v9),$v10),$v11),$v12),$v13),$v14),$v15),$v16),$v17),$v18),$v19),$v20),$v21),$v22),$v23),$v24),$v25),$v26),$v27),$v28),$v29),$v30),$v31),$v32),iaf:sum($v33)))

(id: s2md_BV311_1-1-1)

Compiling it result to a RecursionError with the syntax provided above.

If you want to replicate the real use case, I can explain how, but be aware that it take a few minutes of loading to come to that point.

@ptmcg I tried to be as short as possible in the explanation, but fell free to ask me anything.
The 2 main thing to understand is:

  • We can't change what is accepted in the syntax (since it's defined in a specification)
  • The string to parse is also outside our control.

Other people that can be interested in the discussion: @hefischer (original writer of the code) @austinmatherne-wk

One of the things I found to really help in my Verilog parser was to change some of the terminal items for numeric values from a combination of expressions:

integer = Word(nums)
signed_integer = Combine(Opt(Literal('+') | Literal('-')) + integer)

to a single Regex expression:

signed_integer = Regex(r"[+-]?\d+")

and so on for the floating point, and the floating point with exponent forms. The most obvious benefit I saw was a much improved parse time, but the reduction in number and nesting of expressions probably helped in reducing the recursion as well.

I took a stab at part of your parser that deals with integer and real expressions, not tested but hopefully this might give some initial relief (if not in the recursion department, at least in the speed department):

decimalPoint = Literal('.')
exponentLiteral = CaselessLiteral('e')
plusorminusLiteral = Literal('+') | Literal('-')
digits = Word(nums)
# integerLiteral = Combine(Opt(plusorminusLiteral) + digits)
integerLiteral = Regex(r"[+-]?\d+")
# decimalFractionLiteral = Combine(Opt(plusorminusLiteral) + decimalPoint + digits)
decimalFractionLiteral = Regex(r"[+-]?\.\d+")
# infLiteral = Combine(Opt(plusorminusLiteral) + Literal("INF"))
infLiteral = Regex(r"[+-]?INF")
nanLiteral = Literal("NaN")

# floatLiteral = (
#     Combine(
#         integerLiteral
#         + ((decimalPoint + Opt(digits) + exponentLiteral + integerLiteral) | (exponentLiteral + integerLiteral))
#     )
#     | Combine(decimalFractionLiteral + exponentLiteral + integerLiteral)
#     | infLiteral
#     | nanLiteral
# )
floatLiteral = Regex(
    r"[+-]?\d+"                # integerLiteral
    r"(\.\d*)?"                # decimalPoint + Opt(digits)
    r"[eE][+-]?\d+"            # exponentLiteral + integerLiteral
    r"|"
    r"[+-]?\.\d+[eE][+-]?\d+"  # decimalFractionLiteral + exponentLiteral + integerLiteral
    r"|"
    r"[+-]?INF"
    r"|"
    r"Nan"
)

# decimalLiteral = Combine(integerLiteral + decimalPoint + Opt(digits)) | decimalFractionLiteral
decimalLiteral = Regex(
    r"[+-]?\d\.\d*"
    r"|"
    r"[+-]?\.\d+"
)

The regex change isn't enough, but that still a nice optimization.
I also tried without enable_packrat, which based on my first comment on the other issue, reduce the stack, and it allows the parsing to pass (I should have tried that after I saw this behavior while doing the minimal script, sorry)

Here are some numbers (done on ~2 000 different expressions, plus some custom code in the set_parse_action)
Caveat: it's only one run, but the time difference seem big enough to not need a real performance testing

With enablePackrat (before, some fails) Without enablePackrat (after)
Before Regex ~108s ~67s
After Regex ~83s ~57s

I still need to do more regression testing, but there is a small typo in the decimalLiteral, a + is missing, otherwise everything seems good, thanks!

decimalLiteral = Regex(
    r"[+-]?\d+\.\d*"
    r"|"
    r"[+-]?\.\d+"
)

Hopefully, the followings versions of the taxonomy will not add too many nesting on top of the existing ones.

From my point of view, the issue can be closed, for the packrat optimization, I took a quick look at the _parseCache, everything seems right, so it's probably a wrong use in this particular case (the cache is probably not useful).

Packrat parsing really comes into its own when grammars use the infix_notation method, which creates some lookaheads (to confirm that a particular binary expression is about to be parsed) that are then followed by the same expression being actually parsed. Since you didn't use that method in this parser, but instead implemented the features in your own (similar to how infix parsing is done in fourFn.py), your parser is already more efficient. So it makes sense to me the packrat is not much help here.

Closing