dlang-community / Pegged

A Parsing Expression Grammar (PEG) module, using the D programming language.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Bug: Rule-Level Semantic Actions Not Being Applied

vnayar opened this issue · comments

First let me reference the expected behavior from the documentation:
https://github.com/PhilippeSigaud/Pegged/wiki/Semantic-Actions#expression-level-or-rule-level-actions

There is no real difference between

Rule1 <- Expr1 {Action}

and

Rule1 <{Action} Expr1

For my project, I created a semantic action called "subsume" which basically subsumes a node with a single child into its parent. The intent was to help simplify the parse tree into something closer to an abstract syntax tree for later processing.

While working on this, I found that the rule-level semantic actions were simply not being applied no matter what I tried. Eventually I converted the logic to use expression-level semantic actions, and the behavior changed quite a bit (it works perfectly).

Based on the documentation, I believe this is a bug, and I wanted to share clear reproduction steps to demonstrate it. The following is a minimal program depending on Pegged 0.4.4 using the Arithmetic grammar from the Pegged documentation to demonstrate the difference, using the "Primary" rule.

import std.stdio;
import pegged.grammar;

// Arithmetic1 applies "subsume" to Primary using Expression-Level Semantic Actions.
mixin(grammar(`
Arithmetic1:
    Expr     < Factor AddExpr*
    AddExpr  < ('+'/'-') Factor
    Factor   < Primary {subsume} MulExpr*
    MulExpr  < ('*'/'/') Primary {subsume}
    Primary  < '(' Expr ')' / Number / Variable / '-' Primary {subsume}

    Number   < [0-9]+
    Variable < identifier
`));

// Arithmetic2 applies "subsume" to Primary using Rule-Level Semantic Actions.
mixin(grammar(`
Arithmetic2:
    Expr     < Factor AddExpr*
    AddExpr  < ('+'/'-') Factor
    Factor   < Primary MulExpr*
    MulExpr  < ('*'/'/') Primary
    Primary  <{subsume} '(' Expr ')' / Number / Variable / '-' Primary

    Number   < [0-9]+
    Variable < identifier
`));

// Replaces a single-child parse tree node with its children.
PT subsume(PT)(PT tree) {
  return tree.children[0];
}

void main()
{
  writeln("Arithmetic1 uses Semantic Action 'subsume' every time Primary is used.");
  auto parseTree1 = Arithmetic1(`3 * 5`);
  writeln(parseTree1);
  // Output:
  // Arithmetic1 [0, 5]["3", "*", "5"]
  //  +-Arithmetic1.Expr [0, 5]["3", "*", "5"]
  //     +-Arithmetic1.Factor [0, 5]["3", "*", "5"]
  //        +-Arithmetic1.Number [0, 2]["3"]
  //        +-Arithmetic1.MulExpr [2, 5]["*", "5"]
  //           +-Arithmetic1.Number [4, 5]["5"]

  writeln("Arithmetic2 uses Semantic Action 'subsume' on the Primary rule itself.");
  auto parseTree2 = Arithmetic2(`3 * 5`);
  writeln(parseTree2);
  // Output:
  // Arithmetic2 [0, 5]["3", "*", "5"]
  //  +-Arithmetic2.Expr [0, 5]["3", "*", "5"]
  //     +-Arithmetic2.Factor [0, 5]["3", "*", "5"]
  //        +-Arithmetic2.Primary [0, 2]["3"]
  //        |  +-Arithmetic2.Number [0, 2]["3"]
  //        +-Arithmetic2.MulExpr [2, 5]["*", "5"]
  //           +-Arithmetic2.Primary [4, 5]["5"]
  //              +-Arithmetic2.Number [4, 5]["5"]
}

The expected behavior was that Arithmetic2, using Rule-Level Semantic Actions should have produced the same output as Arithmetic1.

I think you misunderstood the documentation. For both

Rule1 <- Expr1 {Action}

and

Rule1 <{Action} Expr1

Action acts on the entire Expr1. You use the second case expecting it to act on Rule1, which is not what the documentation shows. It may become clear when you show the input of the action: https://run.dlang.io/is/bbpdrx

import std.stdio;
import pegged.grammar;

mixin(grammar(`
Arithmetic4:
    Expr     < Factor AddExpr*
    AddExpr  < ('+'/'-') Factor
    Factor   < Primary MulExpr*
    MulExpr  < ('*'/'/') Primary
    Primary  <{ruleAction} '(' Expr ')' / Number / Variable / '-' Primary

    Number   < [0-9]+
    Variable < identifier
`));

PT ruleAction(PT)(PT tree) {
  writeln("input:\n", tree);
  tree.children[0].name = tree.children[0].name ~ "(rule_action applied)";
  return tree;
}

void main()
{
  auto parseTree4 = Arithmetic4(`3 * 5`);
  writeln(parseTree4);
  // Output:
  // input:
  // or!(and!(literal, Expr, literal), Arithmetic4.Number, Arithmetic4.Variable, and!(literal, Primary)) [0, 2]["3"]
  //  +-Arithmetic4.Number [0, 2]["3"]
  //     +-oneOrMore!(wrapAround!(spacing, charRange!('0','9'), spacing)) [0, 2]["3"]
  //        +-charRange!('0','9') [0, 2]["3"]
  // 
  // input:
  // or!(and!(literal, Expr, literal), Arithmetic4.Number, Arithmetic4.Variable, and!(literal, Primary)) [4, 5]["5"]
  //  +-Arithmetic4.Number [4, 5]["5"]
  //     +-oneOrMore!(wrapAround!(spacing, charRange!('0','9'), spacing)) [4, 5]["5"]
  //        +-charRange!('0','9') [4, 5]["5"]
  // 
  // Arithmetic4 [0, 5]["3", "*", "5"]
  //  +-Arithmetic4.Expr [0, 5]["3", "*", "5"]
  //     +-Arithmetic4.Factor [0, 5]["3", "*", "5"]
  //        +-Arithmetic4.Primary [0, 2]["3"]
  //        |  +-Arithmetic4.Number(rule_action applied) [0, 2]["3"]
  //        +-Arithmetic4.MulExpr [2, 5]["*", "5"]
  //           +-Arithmetic4.Primary [4, 5]["5"]
  //              +-Arithmetic4.Number(rule_action applied) [4, 5]["5"]
}

As you can see from the added (rule_action applied) in the final parse tree, the action is actually applied. But the difference between returning the original tree (or!(and!(...) and its first child (Number) is nil after tree decimation has done its job.

Note that you can get a parse tree of the same size as you seek (but with different names) by dropping the children of Primary: https://run.dlang.io/is/lBMbYF

/+dub.sdl:
dependency "pegged" version="~>0.4.4"
+/
import std.stdio;
import pegged.grammar;

// Arithmetic3 drops the children of Primary.
mixin(grammar(`
Arithmetic3:
    Expr     < Factor AddExpr*
    AddExpr  < ('+'/'-') Factor
    Factor   < Primary MulExpr*
    MulExpr  < ('*'/'/') Primary
    Primary  <; '(' Expr ')' / Number / Variable / '-' Primary

    Number   < [0-9]+
    Variable < identifier
`));

void main()
{
  writeln("Arithmetic3 drops the children of Primary.");
  auto parseTree3 = Arithmetic3(`3 * 5`);
  writeln(parseTree3);
  // Output:
  // Arithmetic3 [0, 5]["3", "*", "5"]
  //  +-Arithmetic3.Expr [0, 5]["3", "*", "5"]
  //    +-Arithmetic3.Factor [0, 5]["3", "*", "5"] 
  //        +-Arithmetic3.Primary [0, 2]["3"]
  //        +-Arithmetic3.MulExpr [2, 5]["*", "5"]
  //           +-Arithmetic3.Primary [4, 5]["5"]
}

Bastiaan.

If I understand this correctly, the RuleAction actually is being applied to the entire rule, but the rule is not what I originally thought it was, it wasn't just the matching part of Primary, e.g. Number or Variable. Rather, the rule itself is a tree with a root 'or' element and a child 'and' element.

This means that the Semantic Action is run BEFORE tree decimation. That means that in order to get the result I want, I should basically do something similar to what tree decimation does (which internally strips the 'and' and 'or' elements while preserving their children).

Correct? Also, thanks for your insight, it really saves a LOT of time on my part, this would not have occurred to me so quickly otherwise!

If I understand this correctly, the RuleAction actually is being applied to the entire rule, but the rule is not what I originally thought it was, it wasn't just the matching part of Primary, e.g. Number or Variable.

Correct.

Rather, the rule itself is a tree with a root 'or' element and a child 'and' element.

Not quite. In

or!(and!(literal, Expr, literal), Arithmetic4.Number, Arithmetic4.Variable, and!(literal, Primary))

and is not the child of or, rather the and is nested inside or etc. It simply is the rule '(' Expr ')' / Number / Variable / '-' Primary expressed in D (semi) code. For example: and!(literal, Expr, literal) refers to '(' Expr ')', or refers to what stands between the / delimiters. If you output the generated parser before mixing it in, you see a shipload of generated parser functions that call the built-in parser functions and, or, literal, etc.

The first child of the whole expression is indeed Arithmetic4.Number as is shown when you print the input of the action.

This means that the Semantic Action is run BEFORE tree decimation. That means that in order to get the result I want, I should basically do something similar to what tree decimation does (which internally strips the 'and' and 'or' elements while preserving their children). Correct?

Yes. I think decimation happens quite late, because you don't want to do work on nodes (sub trees) that fail.

Also, thanks for your insight, it really saves a LOT of time on my part, this would not have occurred to me so quickly otherwise!

You're welcome. It took me a while to see that it was not a bug though, feel free to add more clarity to the documentation (it's a wiki).

I'll work through this more after work, come up with a few good examples, and then update the wiki to cover this relation between decimation and semantic actions in a clear and concise way.

Sounds good.

Maybe the confusion stems from a false similarity between < and assignment =. In assignment, the left hand side and right hand side are basically the same, in that you can manipulate the left hand side by manipulating the right hand side. But in PEG A <{action} B C D, B C D become the children of A, and you cannot manipulate A with action to the extent you want (remove all its children) because action only acts on B C D. A has not come into existence yet.

Oh, I didn't realize that. This means my idea of doing something similar to decimation won't work at all. So it seems that the only way to replace a rule with the output of a semantic action is to list that semantic action with every reference to that rule, it cannot be done with a semantic action on the rule itself.

You can do something similar to decimation, like manipulate the final parse tree before it is used

PT collapsed(PT)(PT tree) {
 // ...
}

auto parseTree = Arithmetic(`3 * 5`).collapsed;

But subsuming with a rule based action is not possible, no. Either use expression based actions like you did or drop children using ; like I did. The advantage of dropping children over subsuming or collapsing is that a parent can only have children of the kind that occur in the rule. If you subsume, you would need to prepare to handle all kinds of node types while you traverse children, because what can and cannot occur as a child may not always be easy to see. Whether or not the latter is a real disadvantage depends of course on how you use the parse tree, and its complexity.

Maybe Pegged could be extended with a subsuming production (like << instead of <) but because of the above, I'm not sure it would be worth it.