haxscramper / hparse

Collection of parser utilities for nim - compile/runtime parser generator.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

deprecated, writing parsers by hand wins for all of my use cases, so don’t plan to maintain/generic parser generator library.

Collection of various utilities related to parsing, ranging from statically typed wrapper on top of scanf to full parser generator, automatic tree-sitter wrappers etc.

Pure nim parser generator supports EBNF notation, tree actions, template rules, compile/runtime parsing and automatic parse tree generation.

Tree-sitter wrapper generator currently WIP, but can already generate user-fiendly interfaces for existing grammars.

Installation and setup

Pure library

nimble install hparse

tree-sitter

hts-wrapgen command-line utility comes instealled with hparse, but it need additional dependencies in order to actually generate grammar.

For official installation instructions for tree-sitter’s getting started page.

tree-sitter-cli installation

Arch linux

tree-sitter-cli is available, and can be installed using sudo pacman -S tree-sitter

Other distros

I couldn’t find any conclusive instructions about installing tree-sitter-cli on other distributions, so this is a most general approach that ‘should work’.

Npm configuration

If you are already familliar with npm, have it installed etc. - you can skip this part. For people like me who have no idea on setting it up:

sudo apt-get -qq install -y npm
export NPM_PACKAGES=$HOME/.npm-packages
npm config set prefix $NPM_PACKAGES
export PATH="$PATH:$NPM_PACKAGES/bin"
Installing command-line tool
npm install tree-sitter-cli -g
Building tree-sitter library
cd /tmp
wget https://github.com/tree-sitter/tree-sitter/archive/0.17.3.tar.gz
tar -xvf 0.17.3.tar.gz && cd tree-sitter-0.17.3
sudo make install
export LD_LIBRARY_PATH=/usr/local/lib

NOTE: last line might not be necessary, but when testing installation in debian container I run into issues where tree-sitter library would not be detected without it.

Using wrapper generator

After all necessary dependencies have been installed, you can generate wrapper for arbitrary grammar file using hts-wrapgen grammar <arguments>. Most commonly used switches are

  • --grammarJs=<grammar file.js> to set input grammar file
  • --langPrefix=<language-name>

You can also use --parserUser=nimCode.nim to immediately compiler test nim file. It is not necessary to always use hts-wrapgen when compiling code that uses generated parser, but it has some built-in heuristics to provide better error messages for common errors (mostly related to linking).

Using generated wrappers

hts-wrapgen generates wrapper file for your language, named <language-name>_wrapper.nim. Wrapper consists of several distict wrapper types, enum listing all possile node kinds, and several helper procs. Generated API is made to resemble that of std/macros as close as possible - len, indexation via [], kind() proc and so on.

By default hts-wrapgen compiles generated parser source into static library which fill be then saved as <language-name>_parser.o alongside your grammar file.

To build final application you need to link it with tree-sitter library. To do this from command-line you need to use --passl:-ltree-sitter. Or use {.passl: "-ltree-sitter".} pragma in your code.

Links

Tree-sitter wrapper

prerequisites

  • Input grammar file, usually called grammar.js. Note that it is not mandatory to name it like this, it is perfectly fine to have cpp-grammar.js for example. Grammar file example, tutorial on writing grammars.

setup guide & compilation process

To compile tree-sitter grammar and generate wrappers run

hts-wrapgen grammarFromFile \
  --grammar=your-grammar.js
  --lang=language

example

Create temporary directory for you project - during compilation a lot of auxiliary files are created, most of which are not needed later on.

For purposes of demonstration cpp grammar will be used. You can get all necessary files by either manually downloading grammar.js and scanner.cc from tree-sitter-cpp repository on github or using wget

wget https://raw.githubusercontent.com/tree-sitter/tree-sitter-cpp/master/src/scanner.cc
wget https://raw.githubusercontent.com/tree-sitter/tree-sitter-cpp/master/grammar.js

To compile cpp grammmar tree-sitter needs to have tree-sitter-c js module installed. It can be obtained using npm install tree-sitter-c

possible errors

linker errors

tree-sitter runtime
Parser generated by tree-sitter is not fully standalone and you program is need to be linked against tree-sitter library. If linker fails with ~undefined reference to `ts_parser_new’~ (or reference to similar function) most likely you need to pass linker flags via either {.passl: "-ltree-sitter".} in your code or --passl:-ltree-sitter
external scanners
External scanners allow you to write custom C code which runs during the lexing process in order to handle lexical rules that cannot be described by regular expressions. This functions are written in separate scanner.c file (example for C++ parser) that has to be compiled and then linked with final application. If you have errors like ~undefined reference to `tree_sitter_cpp_external_scanner_destroy’~ (note word ‘external’) this is indication that you must also link compiled scanner code (for example using {.passl: "cppscanner.o".} (note: name of the linked object file will be different depending on your language name))
c++ stdlib
some scanner implementations might be written in C++ and therefore depend on C++ runtime to operate. If you get compilation errors like undefined reference to `std::__cxx11::basic_ then you need to pass {.passl: "-lstdc++".}
parser runtime
all other linking errors are most likely related to missing linking with compiled parser and can be solved by adding {.passl: "cppparser.o".} (note: name of the linked object file will be different depending on your language name)

Usage in applications

Some of the gree-sitter grammars can be tried out online in interactive playground

Wrapper generates heterogeneous AST with defined [] operator, len and kind procs, which means it can be used with pattern matching. Until I finish implementation you can import it form hmisc/macros/matching and use like

case tree[0]:
  of Declaration[@dtype, .._]:
    echo "first is declaration with type ", dtype.strVal()

It requires to enable {.experimental: "caseStmtMacros".}

  • Generate tree-sitter grammar files using EBNF notation from pure nim parser generator. Grammar is already available at runtime as value so it is simple matter of converting it into javascript code.
  • Implement custom scanners in nim. You already compile nim code to C, so why spend 10x effort writing things in C for scanners when you can just do the same in nim. It is already possible to do this, but process could be streamlined even more.

Small utilities

tscanf

Statically typed wrapper on top of scanf, supports all matcher syntax (e.g $w, $i etc) as well as custom matcher procedures. Tuple ml is injected in the scope with following types for each capture:

  • i, o, b, h : int
  • f : float
  • *, + : string
  • ${matcherProc} : if matcher proc has signature (s: string, arg: var T, start: int): int then tuple field will be of type T
import hparse/tscanf

func matcher1(s: string, arg: var seq[string], start: int): int =
  #                               ^^^^^^^^^^^
  #                               type of the captured variable
  arg = @["##", "$$"]
  return s.len - start

if tscanf("12x12---%1,1,1,1", "$ix$i$+%${matcher1}"):
  echo ml[0] / 2, " = ", ml[2]," ", ml[3]

  assert declared(ml)
  assert type(ml[3]) is seq[string]
  #                     ^^^^^^^^^^^
  #                     Resulting field type in tuple
else:
  assert not declared(ml)
  echo "does not match"
6.0 = --- @["##", "$$"]

hparse/tokenize

Simple addition to parseutils library from stdlib - separate string on tokens based on character sets.

import hparse/tokenize
import hpprint

type
  LispPart = enum
    lpPunct
    lpQuoted
    lpIdent
    lpIntLit

pprint "(hello '(world) 22)".tokenize({
  {'(', ')'} : lpPunct,
  {'0'..'9'} : lpIntLit,
  {'\'', 'a'..'z', 'A'..'Z', ')', '('} : lpQuoted,
  {'a'..'z', 'A'..'Z'} : lpIdent
})
- (lpPunct, "(")
- (lpQuoted, "hello")
- (lpQuoted, "'(world)")
- (lpIntLit, "22")
- (lpPunct, ")")

[WIP] rx macro

Lisp notation for regex description. reimplementation of emacs-lisp rx macro.

(print (rx (and "a" "E") (or "()" "{}")))
aE\(?:()\|{}\)

Parser generator

Parser generator focuses on simplicity and ease of use. Concrete implementation details of particular parser algorithms are hidden as much as possible - you write grammar and provide input tokens, and get a tree. Whole API can be described as

let grammar = makeGrammar:
  # grammar definition

let parser = new<Algorithm-name>Parser(grammar)
var stream = # create token stream
let tree = parser.parse(stream)

Grammar description

Grammar described using EBNF notation, with only exception being use of prefix notation - e.g. for zero-or-more you need to write *E.

Example of very simple grammar:

const defaultCategory = catNoCategory
let grammar = initGrammar[NoCategory, string]:
  A ::= *("hello" & "world")

More complex example (with result tree)

exampleGrammarConst(grammar):
  List ::= !"[" & Elements & !"]"
  Elements ::= Element & @*(@(!"," & Element))
  Element ::= "i" | List

let parser = exampleParser(grammar)
var stream = "[i,i,[i,i,i],i]".mapIt($it).makeTokens().makeStream()
let tree = parser.parse(stream)
echo tree.treeRepr()
+-> List
    +-> Elements
        +-> Element +-> 'i'
        +-> Element +-> 'i'
        +-> Element
        |   +-> List
        |       +-> Elements
        |           +-> Element +-> 'i'
        |           +-> Element +-> 'i'
        |           +-> Element +-> 'i'
        +-> Element +-> 'i'

DSL syntax

EBNF syntax

Note: <string> means a string literal, like “|????”

  • * zero-or-more
  • + one-or-more
  • ? optional
  • & concatenation
  • | alternative
  • Nonterminal ::= ... declare new nontemrinal. Identifier must be uppercased.
  • <string> token literal. Default category is used
  • <string>.prCat or <string>.cat token literal with lexeme <string> and category prCat. Prefix is automatically inferred on grammar construction and can be omitted.
  • [[ expr ]] token with lexeme predicate.
  • [ ... ] option

Tree actions prefix

  • ! drop
  • @ splice-discard
  • ^ promote
  • ^@ splice-promote

Prefix combinations

  • !* drop zero-or-more elements
  • !+ drop one-or-more
  • @+ splice one-or-more
  • @* splice zero-or-more
  • !? drop optional
  • ^? promote optional

Invalid combinations: *!, +!, *@, +@, *^@, +^@, +^, *^

Delimiters

Nonterminals

Tree actions

Result of parser generator is a parse tree - very representation of original source code and contains all helper symbols (punctuation, brackets, precedence levels etc). All of this cruft is necessary to correctly recognize input sequence of tokens, but completely irrelevant afterwards - in nested list grammar only Elements are actually necessary, everything else can be thrown away immediately. Tree actions are intended for this exact purpose - dropping unnecessary parts of the parse tree, flattening out nested parts etc. Right now there is five type of tree actions (four implemented).

Drop

Completely remove subtree element

echo ecompare(@["a", "b", "c"]) do:
  A ::= "a" & "b" & "c"
do:
  A ::= "a" & !"b" & "c"
+-> A        +-> A
    +-> 'a'      +-> 'a'
    +-> 'b'      +-> 'c'
    +-> 'c'

Splice discard

Add subnode elements in parent tree. Subtree head is removed.

echo ecompare(@["-", "+", "+", "+", "-"]) do:
  A ::= "-" & *"+" & "-"
do:
  A ::= "-" & @*"+" & "-"
+-> A                +-> A
    +-> '-'              +-> '-'
    +-> [ [ ... ] ]      +-> '+'
    |   +-> '+'          +-> '+'
    |   +-> '+'          +-> '+'
    |   +-> '+'          +-> '-'
    +-> '-'

Splice promote

Splice all node node elements and replace parent node. NOTE: this replaces only parent node - in expression like E ::= A & B parent node for B is concatenation - not nonterminal head.

echo ecompare(@["-", "+", "+", "+"]) do:
  A ::= "-" & B
  B ::= *"+"
do:
  A ::= "-" & ^@B
  B ::= *"+"
+-> A            +-> A
    +-> '-'          +-> B
    +-> B                +-> '-'
        +-> '+'          +-> '+'
        +-> '+'          +-> '+'
        +-> '+'          +-> '+'

Subrule

Move part of the tree into separate list

echo ecompare(@["-", "z", "e"]) do:
  A ::= "-" & "z" & "e"
do:
  A ::= "-" & { "z" & "e" }
+-> A        +-> A
    +-> '-'      +-> '-'
    +-> 'z'      +-> [ [ ... ] ]
    +-> 'e'          +-> 'z'
                     +-> 'e'

Promote

Parse templates

Some patterns often occur in grammar construction - list with delimiters, kv pairs etc. Even though grammar is pretty simple, writing something like Element & @*(@(!"," & Element)) over and over again is not really fun. Parse templates are designed to solve this issue.

Parse template is a function that will be executed to produce part of the pattern. In this example we generate template rule for comma-separated list of strings.

proc csvList(str: string): Patt[NoCategory, string] =
  andP(
    makeExpNoCat(str).tok(),
    zeroP(andP(
      makeExpNoCat(",").tok().addAction(taDrop),
      makeExpNoCat(str).tok()
    ).addAction(taSpliceDiscard)
    ).addAction(taSpliceDiscard))

echo csvList("@").exprRepr()

echo eparse(@["@", ",", "@"], A ::= %csvList("@"))
{'@' & @*(@{!',' & '@'})}
+-> A
    +-> '@'
    +-> '@'

DSL syntax is %functionName(..<list-of-arguments>..). For codegen-based parsers (recursive LL(1) and LL(*)) function MUST be executable at compile-time. In all other cases grammar construction happens at runtime. In example above LL(*) parser was used.

Parse tree and tokens

Token is has three generic parameters, referred to as C, L and I throughout codebase.

  • First one is ‘category’ for token. It is expected (but not mandatory) to be an enum. Category is usuall things like punctuation, identifier, string/int literal, etc. If you don’t need token category use NoCategory enum.A
  • Second parameter - ‘lexeme’. It is can be absolutely anything (void included). This field stores ‘all other’ information about token - integer/string value for literals for example.
  • Last parameter ‘information’. Similar to lexeme - but made for storing additional ‘metainformation’ for token: position in source code, order in original token stream etc. THis information is NOT used in parsing.

For example of custom token category/lexeme see manual.org

Token lexeme predicates

Token is accepted if lexeme predicate evaluates to ‘true’. Predicate is placed in double square braces = [[ expr ]]. Depending on syntax of the expression different actions are performed.

  • if it is Infix, Call or DotExpr (ex: it in ["a", "B"], startsWith(it, "..")) whole expression is wrapped into predicate function proc(it: L): bool {.noSideEffect.} = <your-expression>.
  • otherwise it is passed to makeExpTokenPredUsr(cat: C, val: <your-expression-type>)
import strutils, strformat
const defaultCategory = catNoCategory


func makeExpTokenPredUsr(
  cat: NoCategory, valset: bool): ExpectedToken[NoCategory, string] =

  result = makeExpTokenPred[NoCategory, string](
    catNoCategory, # Expected token category
    &"[{valset}]", # string representation of expected token predicate
                   # (for pretty-printing)
    proc(str: string): bool = valset # Construct predicate yourself
  )

initGrammarConst[NoCategory, string](grammar):
  A ::= *(B | C)
  B ::= [[ it.startsWith("@") ]]
  #          ^^^^^^^^^^^^^^^^^^
  #          |
  #          Copied to predicate directly
  C ::= [[ true ]] # Fallback nonterminal
  #        ^^^^
  #        |
  #        Passed to `makeExpTokenPredUsr`

let parser = newLLStarParser[NoCategory, string, void](grammar)
var stream = @["@ident", "#comment", "@ident"].makeTokens().makeStream()
let tree = parser.parse(stream)
echo tree.treeRepr()

Development

Large part of the design is described in devnotes.org, all functions and types are documented in the source code. If you have any additional questions feel free to join my discord server and ask questions there.

Rationale

I’m not an expert on parsing algorithms and related things, so I tried to design it in a way that would actually abstract things and make it easy to understand the API.

Not supporting syntactic predicates allows use of multiple parsing algorithms for the same grammar, ranging from restrictive but fast LL(1) to something like earley parser.

The parser abstracts notion of token and is not tied to any lexer implementation - if you want to can just split string on spaces and call it a lexer. Or you can do some heuristics in lexer and assign category based on context. Or something else, I don’t know now.

The whole grammar is available as a value, which means it is possible to easily do all sorts of preprocessing, error detection (like using undeclared nonterminal, left recursion detection and so on).

Tree actions and template rules provide small, but hopefully useful subset of syntactic actions. Advantage - it is possible to know how exactly the tree will look like. Generating statically typed case object for a grammar is possible.

Parser generator was originally intended to work in conjunction with term rewriting system. You write grammar in EBNF notation, dropping all cruft immediately (using splice-discard and drop rules) and then declaratively transform tree into something else.

State of development

Parser generator is currently work-in-progress. All advertized features are implemented, but number of supported algorithms is lacking - fully supported is only backtracking LL(*). Codegen and table-driven LL(1) are partially supported (have some weird bugs). Some work has been done on adding SLR and Earley parser.

Parser generator has relatively clean and documented internal API, designed to make implementation of new algorithms as simple as possible (most of details are abstracted).

Contribution

All sorts of contributions are welcome - issues, unit tests, documentation updates etc.

In addition there are several things that I wasn’t able to implement myself. If you are interested to solve one of there problems it will be especially useful.

If you have any question about implementation details, API etc. you can join my discord server.

Unsolved problems

tree-sitter fails at runtime inside docker container

WARNING: another issue I ran into - when actually running compiled & linked library in container, it just dies with, and I have no idea how to fix it.

__memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:440
440	../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory.

Fix tree after EBNF -> BNF rewriting

Only recursive descent parsers can accept EBNF notation as-is. Every other one requires conversion from EBNF to BNF (implemented, tested). The problem is - this trasnformation changes shape of the parsed tree. For example A ::= *(E) is converted to A ::= E1 and E1 ::= Ɛ | E E1 - recursion is replaced with iteration.

const defaultCategory = catNoCategory
initGrammarConst[NoCategory, string](grammar):
  A ::= "hello" & *(B) & "world"
  B ::= "!!"

var toks = @[
  "hello", "!!", "!!", "!!", "world"].makeTokens().makeStream()

let grammarVal =
  block:
    let tmp = grammar
    tmp.toGrammar()

echo "Original grammar"
echo grammarVal.exprRepr()
echo "---\n"

echo "Grammar converter to BNF"
echo grammarVal.toBNF().exprRepr()
echo "---\n"

echo "Recursive descent tree"
let parser1 = newLLStarParser[NoCategory, string, void](grammar)
let tree1 = parser1.parse(toks)
echo tree1.treeRepr()
echo "---\n"

toks.revertTo(0)

echo "Table-driven parser tree without structure fixup"
let parser2 = newLL1TableParser(
  grammarVal,
  dofixup = false,
  retainGenerated = true
)
let tree2 = parser2.parse(toks)
echo tree2.treeRepr()
echo "---\n"


toks.revertTo(0)

echo "Table-driven parser tree with fixup"
let parser3 = newLL1TableParser(grammarVal, dofixup = true)
let tree3 = parser3.parse(toks)
echo tree3.treeRepr()
echo "---\n"
Original grammar
A            ::= {'hello' & *(<B>) & 'world'}
B            ::= '!!'
---

Grammar converter to BNF
A  ::=
.0 | 'hello' & <A0_1> & 'world'

B  ::=
.0 | '!!'

A0_1  ::=
.0 | Ɛ
.1 | <B> & <@A0_1>

---

Recursive descent tree
+-> A
    +-> 'hello'
    +-> [ [ ... ] ]
    |   +-> B +-> '!!'
    |   +-> B +-> '!!'
    |   +-> B +-> '!!'
    +-> 'world'
---

Table-driven parser tree without structure fixup
+-> A
    +-> 'hello'
    +-> A0_1
    |   +-> B +-> '!!'
    |   +-> A0_1
    |       +-> B +-> '!!'
    |       +-> A0_1
    |           +-> B +-> '!!'
    +-> 'world'
---

Table-driven parser tree with fixup
+-> A
    +-> 'hello'
    +-> [ [ ... ] ]
    |   +-> B +-> '!!'
    |   +-> B +-> '!!'
    |   +-> B +-> '!!'
    +-> 'world'
---

Instead of *(B) new rule A0_1 is introduced, with two possible alternatives: either empty production (Ɛ) or B, followed by A0_1 again. How this conversion affects parse tree can be seen in the output: instead of simple list of elements you get deeply nested tree of A0_1. This is fixed automatically when converting EBNF grammar to BNF by adding ‘splice’ rule on every use of newly generated pattern.

It kind of works (not really tested though), but I’m yet to figure how to preserve original tree actions. For example, when converting something like @*(@{!',' & <Element>})} to BNF it gets flattened out, and it is not clear how to first splice things in !',' & <Element>, and then splice it again.

Future development

  • [ ] support `<token-literal>` in grammar
  • [ ] generate errors on unknown nonterminals used in production
  • [ ] Unit test for nimscript and js
  • [ ] Error reporting. Right now it is basically non-existent

Generate statically typed parse tree

Right now parse tree is ‘stringly typed’ - nonterminal heads are described using string and all subnodes are placed in the same subnodes: seq[ParseTree[...]].

Grammar DSL contains all necessary information to construct case object with selector enum as well as order all fields (LL(*) parser uses constant grammar to generate set of mutally recursive functions). Tree actions could provide almost all necessary information for field types and ordering.

Possible mapping from grammar to constructed object

  • Nterm ::= ... -> of ptrNterm: <fields>
  • E1 & E2 & E3 -> tuple[e1: <type-of-E1>, ... ]
  • *E1 and +E1 -> seq[<type-of-E1>]
  • ?E1 -> Optional[<type-of-E1>]
  • E1 | E2 -> case idx: [<number-of-alternatives>] and each alternative gets it’s own field. Case objects can be nested so this is not a problem.
  • <token> -> tok: Token[...]

There are several questions related to possible use cases, ease of use etc.

  • [ ] Determenistic and intuitive names for fields.
  • [ ] How fields should be named? It is not possible to have same-named fields in nim case objects.

Different type of tree

Right now ParseTree[C, L, I] is hardcoded into all parsers - I don’t think it will be enough for all use cases.

  • It is required to make separate type of parse tree defined for each grammar is
  • Inegration with nimtrs - construct term instead of parse tree and maybe run rewriting actions immediately.

L and S-attributed grammars

Parser based on definitive clause grammars

I’m like, 40% sure that I’m not sure about what it is, but it looked nice when I saw it last time. It is related to prolog and nimtrs already implements large portions (no clauses and backtracking but full support of unification and all auxiliary functions for working with terms and environments).

DSL error reporting

DSL for this library uses hmisc/hexceptions to generate much better compilation errors in case of malformed DSL.

let tree = "h".exampleParse:
  A ::= !@*("h")

echo tree.treeRepr()
Unexpected prefix: '!@*'

 2   let tree = "h".exampleParse:
 5:8   A ::= !@*("h")
             ^^^
             |
             Incorrect prefix combination



Raised in grammar_dsl.nim:112


 [CodeError:ObjectType]

NOTE: output is not colored in readme (because github fails to support this basic feature since 2014), but it is colored by default terminal (controlled by using -d:plainStdout compilation flag).

About

Collection of parser utilities for nim - compile/runtime parser generator.


Languages

Language:Nim 99.4%Language:JavaScript 0.5%Language:Shell 0.1%