performance issue when parsing => grammar not good?

Question

performance issue when parsing => grammar not good?

cons0l3 opened this issue 3 years ago · comments

Dear all,

first of all: nice work! keep it up.

I am currently working on an expression parser (arpeggio 1.10.1) that mimics python expressions:

from arpeggio import *
from arpeggio import RegExMatch as _

class SuppressStrMatch(StrMatch):
    suppress = True

def comment():          return [_("//.*"), _("/\*.*\*/")] 

def expression():       return Optional(disjunction), EOF

def disjunction():      return [(conjunction, OneOrMore( Kwd("OR"), conjunction ) ), conjunction]
def conjunction():      return [(inversion, OneOrMore( Kwd("AND"), inversion) ), inversion]
def inversion():        return [ (Kwd('NOT'), inversion), comparison]
def comparison():       return [ compare_op, plus]
def compare_op():       return [ binaryOp, notin, element_in, isnot, element_is ]
def binaryOp():         return plus, ['==','!=','<=','<','>=','>'], plus
def notin():            return plus, 'not', 'in', SuppressStrMatch("["), disjunction, ZeroOrMore(SuppressStrMatch(","), disjunction), SuppressStrMatch("]")
def element_in():       return plus, 'in', SuppressStrMatch("["), disjunction, ZeroOrMore(SuppressStrMatch(","), disjunction), SuppressStrMatch("]")
def isnot():            return plus, 'is', 'not', plus
def element_is():       return plus, 'is', plus
def plus():             return term, ZeroOrMore(['+', '-'], term)
def term():             return factor, ZeroOrMore(['*', '/'], factor)
def factor():           return Optional(['+','-']), [value, parenthesis]
def value():            return [literal, symbol]

def literal():          return [numericLiteral, stringLiteral, constantLiteral ]
def numericLiteral():   return _(r'\d*\.\d*|\d+') #(E(+|-)?\d+)?
def stringLiteral():    return [_(r'".*?"'), _(r"'.*?'")]
def constantLiteral():  return [_(r"True"), _(r"False"), _(r"NULL")]
def symbol():           return _(r"\w+")
def parenthesis():      return SuppressStrMatch("("), disjunction, SuppressStrMatch(")")

ExpressionParser:ParserPython = ParserPython(expression, comment, debug=False, reduce_tree=True, ignore_case=True)

when I execute a parse:
expression = ExpressionParser.parse("""(p in [37,47] and (c in ["A", "P"] or l==true)) or (p in [75,76,361,390,391] and c in ["A", "P"])""")

it takes about 8 seconds. that is way to long. I would like it to be way below 1 second for parsing such a simple expression.

could anyone assist in explaing or advising how to fix the grammar? (which is directly derived from the peg parser of python itself and translated to arpeggio style)

thanks
carsten

Thanks for pushing me in the right direction

Carsten Steckel · Answer 1 · Wed Apr 14 2021 16:25:35 GMT+0800 (China Standard Time)

oh, and I have really aweful exceptipons when the parsed string has syntax issues. it is always stack exceptions and not the nice exception "hey at position 3 after the "or" something is not right..." kind of deal.

Igor Dejanović · Answer 2 · Wed Apr 14 2021 17:09:13 GMT+0800 (China Standard Time)

For these kinds of grammars where there is a lot of backtracking (expressions grammars are a good example) you should turn on memoization which can yield a massive speed improvements.

Here is a test on my machine:

import time
# Without memoization
ExpressionParser = ParserPython(expression, comment, debug=False, reduce_tree=True,
                                ignore_case=True)
exp_str = """(p in [37,47] and (c in ["A", "P"] or l==true))
             or (p in [75,76,361,390,391] and c in ["A", "P"])"""
start = time.process_time()
exp_nomem = ExpressionParser.parse(exp_str)
print(exp_nomem.tree_str())
print("TIME:", time.process_time() - start, "\n")

# With memoization
ExpressionParser = ParserPython(expression, comment, debug=False, reduce_tree=True,
                                ignore_case=True, memoization=True)
start = time.process_time()
exp_mem = ExpressionParser.parse(exp_str)
print(exp_mem.tree_str())
print("TIME:", time.process_time() - start)

Output:

expression=Sequence [1-110]
  disjunction=OrderedChoice [1-108]
    conjunction=OrderedChoice [1-45]
      element_in=Sequence [1-12]
        symbol=RegExMatch(\w+) [1-2]: p
...
  EOF [110-110]: 
TIME: 9.193519810000002 

expression=Sequence [1-110]
  disjunction=OrderedChoice [1-108]
    conjunction=OrderedChoice [1-45]
      element_in=Sequence [1-12]
...
  EOF [110-110]: 
TIME: 0.004547567999999558

As you can see, without memoization it takes over 9 secs while with memoization it takes only 4 ms.

For handling exception see this part of the docs

Igor Dejanović · Answer 3 · Wed Apr 14 2021 17:14:04 GMT+0800 (China Standard Time)

TLDR; for exception the easy way to get just the error message without stacktrace is to catch NoMatch exception and just print it.

Carsten Steckel · Answer 4 · Wed Apr 14 2021 17:15:42 GMT+0800 (China Standard Time)

thanks igor! it seems i can also improve performance by rewriting a few rules. but memotization already improved significantly.

Carsten Steckel · Answer 5 · Wed Apr 14 2021 17:23:34 GMT+0800 (China Standard Time)

you are my hero of the day!

Carsten Steckel · Answer 6 · Wed Apr 14 2021 19:31:55 GMT+0800 (China Standard Time)

I rewrote the first few rules to be:

def disjunction():      return conjunction, ZeroOrMore( Kwd("OR"), conjunction )
def conjunction():      return inversion, ZeroOrMore( Kwd("AND"), inversion)
def inversion():        return Optional(Kwd('NOT')), comparison
def comparison():       return [ binaryCompareOp, notin, element_in, isnot, element_is, plus ]
def binaryCompareOp():  return plus, ['==','!=','<=','<','>=','>'], plus

and that alone helped considerably plus Igors tipp.

Problem solved.

Igor Dejanović · Answer 7 · Wed Apr 14 2021 20:05:05 GMT+0800 (China Standard Time)

I'm glad it solved your problem. I'll leave the issue open labeled with question to be easier to find.