Peggy's Parser

peggy is a small and efficient parser generator based on PEG grammars.

It can parse nested grammars and supports repetition operators. Error reporting is tailored to be as intuitive and readable as possible.

There are two crates: peggy, which contains all the library's code, and peggy_macro which allows to generate a Rust module from a grammar to make a native parser.

Examples

You can find several examples in the source directories for peggy and peggy_macro, notably:

peggy/rpn - A Reverse Polish Notation (RPN) evaluator, using the runtime engine
macro/rpn - The same RPN evaluator but using a parser generator

RPN example

A Reverse Polish Notation (RPN) grammar may look like this:

S = °B_WHITESPACE                           # Whitespace
DEC_SEP = °("." | ",")                      # Decimal separator

int = @(B_ASCII_DIGIT+)                     # Integer
float = int DEC_SEP int                     # Floating-point number
number = int | float                        # Number

operator = "+" | "-" | "*" | "/"            # Operator
operand = number | paren_expr               # Operand
operation = operand S+ operand S* operator  # Complete operation

paren_expr = °"(" S* expr S* °")"           # Expression wrapped between parenthesis
expr = number | operation | paren_expr      # Complete expression

main = expr                                 # Grammar's entrypoint

This will be able to match complex operations like (3 (9.3 3 /) +) (5 (2 3 /) /) /.

Parser generator usage

Here is an example usage of the parser generator:

use peggy_macro::peggy_gen;

#[peggy_gen(filename = "rpn.peggy")]
pub mod rpn_grammar {}

fn main() {
    // Evaluate the expression
    let success = rpn_grammar::exec("(3 (9.3 3 /) +) (5 (2 3 /) /) /").unwrap();

    // Do your stuff
}

The main advantage of using parser generators is the better performances (~ 10 times than the optimized runtime engine), as well as the easier use and safety: you will directly extract the informations from your grammar without having to check unreachable statements. This also means that updating your grammar will instantly show what parts of your code needs to be updated.

The generated types are optimized to be as lightweight and easy-to-use as possible ; the only corner case being the use of the Rc type to store informations in recursive patterns.

The success type returned by ::exec is generated depending on the input grammar ; if your IDE doesn't expand procedural macros and doesn't provide you informations about the generated types, you can take a look at the result's content by using the dbg!() macro (or format!("{:#?}") for formatting purposes).

All of the generated types implement the Debug and Clone traits.

Performances

On my computer (Intel Core i7-9700F), in release mode the grammar is parsed in 16 microseconds (0.016 milliseconds) while the runtime engine takes about 128 microseconds (0.128 milliseconds).

As you can guess, these increases linearly with the size of the inputs, which can lead to a time of multiple seconds if you parse tens of thousands of kilobytes.

With the parser generator, we go down from 128 microseconds to only 6.5 (so 0.065 milliseconds).

Elegant error reporting

A simple grammar like the one shown in peggy/src/lib.rs will give the following error message:

ERROR: At line 1, column 7:

1 | Hello worlf !!
          ^

In rule [world]: Expected string literal "world"

Grammar specifications

General syntax rules:

Comments start with # and go to the end of the current line. They can be put everywhere, although in strings they will be treated as a part of it.
Lines can start and end with whitespaces, which will be ignored
Whitespaces can be put everywhere unless stated otherwise

If a line is only made of three # symbols (optionally wrapped by whitespaces), it will open a multi-line comments. Every line will be ignored until another 3-#-only line is found.

The grammar is made of multiple lines, which can either be:

Empty
Made of whitespaces
Empty or made of whitespaces AND of a comment
A rule declaration

Rule declarations start with the rule's name, which must respect the following rules:

Only alphanumeric and underscores are allowed
The name cannot start with a digit
The name cannot start with B_ as this is reserved for builtin rules
The name cannot start with E_ as this is reserved for external rules
They must contain at least one character
Two rules cannot have the same name

They continue with the assignment operator (=) and the rule's content, which is made of a pattern. Patterns can either be:

A fixed string, between double quotes - there is no escaping machnism, newline symbols and double quotes can be matched using builtin rules
Another rule's name (the provided rule will be used for matching)
A group (a pattern wrapped between parenthesis)
A list of patterns separated by whitespaces (all patterns will need to match the input)
An union of patterns separated by vertical bars | (at least one the pattern will need to match the input)

Patterns can be decorated with a repetition model (no whitespace must be present between the end of the pattern and the model). It can either be:

+: match this pattern as much as possible, but at least once
*: match this pattern as much as possible, zero matching is allowed
?: match this pattern one time if possible, zero matching is allowed

Patterns can also be set a mode by prefixing them with a character (no space allowed):

°: silent pattern - will not capture anything
~: peek patterns - does not capture or consume anything
!: negative pattern - will match only if the inner pattern doesn't ; does not capture or consume anything
@: atomic patterns - will be returned as a single string if matching

Please note that, unlike any other feature, atomic patterns will add a lifetime to the success type to be able to store the input slice. This avoids any form of heap allocation, but will make a lifetime appear in all parent patterns (and so, forcibly in the global success type) if you suddenly introduce an atomic pattern. This shouldn't be a problem in most cases, but keep that in mind.

Builtin rules

There are multiple builtin rules, which will only match at most one single character:

Rule's name	Description
`B_ANY`	Any character
`B_NEWLINE_CR`	Match `\r` newline characters
`B_NEWLINE_LF`	Match `\n` newline characters
`B_DOUBLE_QUOTE`	Match a double quote
`B_ASCII`	ASCII characters
`B_ASCII_ALPHABETIC`	ASCII alphabetic characters
`B_ASCII_ALPHANUMERIC`	ASCII alphanumeric characters
`B_ASCII_CONTROL`	ASCII control characters
`B_ASCII_DIGIT`	ASCII digits
`B_ASCII_GRAPHIC`	ASCII graphic characters
`B_ASCII_HEXDIGIT`	ASCII hexidecimal digits
`B_ASCII_LOWERCASE`	ASCII lowercase characters
`B_ASCII_PUNCTUATION`	ASCII punctuation characters
`B_ASCII_UPPERCASE`	ASCII uppercase characters
`B_ASCII_WHITESPACE`	ASCII whitespaces
`B_ALPHABETIC`	Unicode alphabetic characters
`B_ALPHANUMERIC`	Unicode alphanumeric characters
`B_CONTROL`	Unicode control characters
`B_LOWERCASE`	Unicode lowercase characters
`B_NUMERIC`	Unicode numeric characters
`B_UPPERCASE`	Unicode uppercase characters
`B_WHITESPACE`	Unicode whitespaces

External characters

A callback can be provided to the execution engine to handle external rules, which are prefixed with E_. See the documentation for more informations.

License

This project is released under the Apache-2.0 license terms.

ClementNerma / Peggy