This is a work in progress. Right now it can read a grammar (using an extension of the notation used by pgen2 for CPython's Grammar) and generate a pure Python module that contains a packrat parser.
Note that this deviates from the standard PEG notation in various ways:
- It requires a separate tokenizer (currently tied to tokenize.py)
- The notation is different from the standard PEG formalism:
- Use
:
instead of<-
- Use
|
instead of/
- Notation for tokens is the same as in CPython's Grammar too
- Use
- Handling of operators and reserved words is a bit janky
Both the generator and the generated parsers require Python 3.8 -- it
turns out writing a recursive-descent packrat parser is a really great
use case for the walrus operator (:=
).
I've started blogging on Medium about this. I'll probably end up rewriting everything based on the approach from the blogs. Episodes:
- Series overview
- PEG Parsers
- Building a PEG Parser
- Generating a PEG Parser
- Visualizing PEG Parsing
- Left-recursive PEG grammars
- Adding actions to a PEG grammar
- A Meta-Grammar for PEG Parsers
- Implementing PEG Features
- PEG at the Core Developer Sprint
I gave a talk about this at North Bay Python: Writing a PEG parser for fun and profit
I am working on generating C code for a Python extension based on the same grammar notation. This will produce an AST that can be compiled to running code.
It is not yet complete, but a preliminary test shows that it can parse
a file of 100,000 lines containing simple expressions (data/xxl.txt
)
in ~0.8 seconds, using ~420 MiB of memory. For comparison, compiling
the same file to bytecode currently takes ~2.5 seconds, using ~870
MiB. (A newer version can produce working AST nodes, and it produces
the AST for that same file in ~5.9 seconds, using ~1100 miB; the
stdlib ast module does the same in ~6 seconds, using ~880 MiB.
However these times are on a faster machine. Likely the majority of
this time is spent converting the internal AST to the public AST.)
PS. It's pronounced "peggen".