goodmami / pe

Fastest general-purpose parsing library for Python with a familiar API

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

More "common" optimizations

goodmami opened this issue · comments

There are currently two "common" optimizations:

  • ![...] . becomes a negated character class
  • A sequence of one item becomes the item only (helps when the previous optimization results in the negated character class being the only thing left in its sequence)

Here are some more:

  • [a] -> "a": a single-character non-negated character class becomes a single-character literal
  • "a" "bc" "d" -> "abcd": a sequence of literals becomes a single literal
  • [ab] / "m" / [yz] -> [abmyz]: a choice of non-negated classes or single-character literals becomes a single class with the union of the others
  • (![abc] .) / (![cde] .) -> (![c] .): a choice of negated classes becomes a single class with the intersection of the others
  • [cdeabzab] -> [a-ez]: duplicates in a class (negated or not) are removed; contiguous runs are replaced with ranges

Some of these may become available as other optimizations are performed, so a single pass may not fully optimize the grammar.

I will save the last two (choice of negated character classes, character class simplification) for later. The rest are implemented as of 44ecc0d.