ClementNerma / Peggy

A PEG parser generator, written in Rust

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Peggy's Parser

peggy is a small and efficient parser generator based on PEG grammars.

It can parse nested grammars and supports repetition operators. Error reporting is tailored to be as intuitive and readable as possible.

There are two crates: peggy, which contains all the library's code, and peggy_macro which allows to generate a Rust module from a grammar to make a native parser.

Examples

You can find several examples in the source directories for peggy and peggy_macro, notably:

  • peggy/rpn - A Reverse Polish Notation (RPN) evaluator, using the runtime engine
  • macro/rpn - The same RPN evaluator but using a parser generator

RPN example

A Reverse Polish Notation (RPN) grammar may look like this:

S = °B_WHITESPACE                           # Whitespace
DEC_SEP = °("." | ",")                      # Decimal separator

int = @(B_ASCII_DIGIT+)                     # Integer
float = int DEC_SEP int                     # Floating-point number
number = int | float                        # Number

operator = "+" | "-" | "*" | "/"            # Operator
operand = number | paren_expr               # Operand
operation = operand S+ operand S* operator  # Complete operation

paren_expr = °"(" S* expr S* °")"           # Expression wrapped between parenthesis
expr = number | operation | paren_expr      # Complete expression

main = expr                                 # Grammar's entrypoint

This will be able to match complex operations like (3 (9.3 3 /) +) (5 (2 3 /) /) /.

Parser generator usage

Here is an example usage of the parser generator:

use peggy_macro::peggy_gen;

#[peggy_gen(filename = "rpn.peggy")]
pub mod rpn_grammar {}

fn main() {
    // Evaluate the expression
    let success = rpn_grammar::exec("(3 (9.3 3 /) +) (5 (2 3 /) /) /").unwrap();

    // Do your stuff
}

The main advantage of using parser generators is the better performances (~ 10 times than the optimized runtime engine), as well as the easier use and safety: you will directly extract the informations from your grammar without having to check unreachable statements. This also means that updating your grammar will instantly show what parts of your code needs to be updated.

The generated types are optimized to be as lightweight and easy-to-use as possible ; the only corner case being the use of the Rc type to store informations in recursive patterns.

The success type returned by ::exec is generated depending on the input grammar ; if your IDE doesn't expand procedural macros and doesn't provide you informations about the generated types, you can take a look at the result's content by using the dbg!() macro (or format!("{:#?}") for formatting purposes).

All of the generated types implement the Debug and Clone traits.

Performances

On my computer (Intel Core i7-9700F), in release mode the grammar is parsed in 16 microseconds (0.016 milliseconds) while the runtime engine takes about 128 microseconds (0.128 milliseconds).

As you can guess, these increases linearly with the size of the inputs, which can lead to a time of multiple seconds if you parse tens of thousands of kilobytes.

With the parser generator, we go down from 128 microseconds to only 6.5 (so 0.065 milliseconds).

Elegant error reporting

A simple grammar like the one shown in peggy/src/lib.rs will give the following error message:

ERROR: At line 1, column 7:

1 | Hello worlf !!
          ^

In rule [world]: Expected string literal "world"

Grammar specifications

General syntax rules:

  • Comments start with # and go to the end of the current line. They can be put everywhere, although in strings they will be treated as a part of it.
  • Lines can start and end with whitespaces, which will be ignored
  • Whitespaces can be put everywhere unless stated otherwise

If a line is only made of three # symbols (optionally wrapped by whitespaces), it will open a multi-line comments. Every line will be ignored until another 3-#-only line is found.

The grammar is made of multiple lines, which can either be:

  • Empty
  • Made of whitespaces
  • Empty or made of whitespaces AND of a comment
  • A rule declaration

Rule declarations start with the rule's name, which must respect the following rules:

  • Only alphanumeric and underscores are allowed
  • The name cannot start with a digit
  • The name cannot start with B_ as this is reserved for builtin rules
  • The name cannot start with E_ as this is reserved for external rules
  • They must contain at least one character
  • Two rules cannot have the same name

They continue with the assignment operator (=) and the rule's content, which is made of a pattern. Patterns can either be:

  • A fixed string, between double quotes - there is no escaping machnism, newline symbols and double quotes can be matched using builtin rules
  • Another rule's name (the provided rule will be used for matching)
  • A group (a pattern wrapped between parenthesis)
  • A list of patterns separated by whitespaces (all patterns will need to match the input)
  • An union of patterns separated by vertical bars | (at least one the pattern will need to match the input)

Patterns can be decorated with a repetition model (no whitespace must be present between the end of the pattern and the model). It can either be:

  • +: match this pattern as much as possible, but at least once
  • *: match this pattern as much as possible, zero matching is allowed
  • ?: match this pattern one time if possible, zero matching is allowed

Patterns can also be set a mode by prefixing them with a character (no space allowed):

  • °: silent pattern - will not capture anything
  • ~: peek patterns - does not capture or consume anything
  • !: negative pattern - will match only if the inner pattern doesn't ; does not capture or consume anything
  • @: atomic patterns - will be returned as a single string if matching

Please note that, unlike any other feature, atomic patterns will add a lifetime to the success type to be able to store the input slice. This avoids any form of heap allocation, but will make a lifetime appear in all parent patterns (and so, forcibly in the global success type) if you suddenly introduce an atomic pattern. This shouldn't be a problem in most cases, but keep that in mind.

Builtin rules

There are multiple builtin rules, which will only match at most one single character:

Rule's name Description
B_ANY Any character
B_NEWLINE_CR Match \r newline characters
B_NEWLINE_LF Match \n newline characters
B_DOUBLE_QUOTE Match a double quote
B_ASCII ASCII characters
B_ASCII_ALPHABETIC ASCII alphabetic characters
B_ASCII_ALPHANUMERIC ASCII alphanumeric characters
B_ASCII_CONTROL ASCII control characters
B_ASCII_DIGIT ASCII digits
B_ASCII_GRAPHIC ASCII graphic characters
B_ASCII_HEXDIGIT ASCII hexidecimal digits
B_ASCII_LOWERCASE ASCII lowercase characters
B_ASCII_PUNCTUATION ASCII punctuation characters
B_ASCII_UPPERCASE ASCII uppercase characters
B_ASCII_WHITESPACE ASCII whitespaces
B_ALPHABETIC Unicode alphabetic characters
B_ALPHANUMERIC Unicode alphanumeric characters
B_CONTROL Unicode control characters
B_LOWERCASE Unicode lowercase characters
B_NUMERIC Unicode numeric characters
B_UPPERCASE Unicode uppercase characters
B_WHITESPACE Unicode whitespaces

External characters

A callback can be provided to the execution engine to handle external rules, which are prefixed with E_. See the documentation for more informations.

License

This project is released under the Apache-2.0 license terms.

About

A PEG parser generator, written in Rust

License:Apache License 2.0


Languages

Language:Rust 100.0%