otac0n / Pegasus

A PEG parser generator for .NET that integrates with MSBuild and Visual Studio.

Home Page:http://otac0n.com/Pegasus/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

I was hoping V4.0 would improve the memory/performance problems I encountered, but to my disappointment they are much worse now :-(

jbovensa opened this issue · comments

I am using a very big PEG file, which I can send to you.
As far as I understand it has much left-recursion (I did not write it)

With version 3.1 I could get so far as parsing a 10 word sentence,
but beyond that I would either get an OutOfMemoryException (if I chose to memoize all the rules) or the parser would run forever.

With version 4.0:
a. I had to remove some rules from the PEG because the compiler complained about ambiguous left recursion.
b. After I managed to get it to compile, I ran a 3 word sentence and got a memory overflow,
and without the memoization it just ran forever.

Please, please, please, help me.

I have been trying to work properly with this PEG file for ages.
I know it's linear-time parseable because a java parser exists over the file.

Can you send the grammar file? You can send it to my email, john@gietzen.us, if you prefer. I'll be happy to take a look.

I'm still looking into this.

I see that you are trying to create a Lojban parser I created unit tests, based on the original Lojban grammar at https://github.com/mhagiwara/camxes.js/blob/master/camxes.js.peg, and found that the grammar doesn't actually compile in this form.

I get these errors:

  • error PEG0023: The rule 'erasable_clause' is ambiguously left-recursive. (erasable_clause, bu_clause_no_pre, pre_zei_bu, si_clause, si_word, zei_clause_no_pre)
  • error PEG0023: The rule 'bu_clause_no_pre' is ambiguously left-recursive. (erasable_clause, bu_clause_no_pre, pre_zei_bu, si_clause, si_word, zei_clause_no_pre)
  • error PEG0023: The rule 'pre_zei_bu' is ambiguously left-recursive. (erasable_clause, bu_clause_no_pre, pre_zei_bu, si_clause, si_word, zei_clause_no_pre)
  • error PEG0023: The rule 'si_clause' is ambiguously left-recursive. (erasable_clause, bu_clause_no_pre, pre_zei_bu, si_clause, si_word, zei_clause_no_pre)
  • error PEG0023: The rule 'si_word' is ambiguously left-recursive. (erasable_clause, bu_clause_no_pre, pre_zei_bu, si_clause, si_word, zei_clause_no_pre)
  • error PEG0023: The rule 'zei_clause_no_pre' is ambiguously left-recursive. (erasable_clause, bu_clause_no_pre, pre_zei_bu, si_clause, si_word, zei_clause_no_pre)

Can you help me resolve these issues? Would you be willing to look at the grammar? (You can find it here)

Hi @jbovensa, I believe I have found and fixed the issues.

First of all, this rule was causing Pegasus to fail to compile as we discussed above:

cmene <object> = expr:(!h &consonant_final coda? (any_syllable / digit)* &pause) { new { label = "cmene", arg = _join(expr) } }

This is because this rule might be zero width, and other rules were not expecting this. After familiarizing myself a bit with the grammar, I believe the intent was to require at least a single iteration of the (any_syllable / digit) expression. The fix is to switch the * for + as here:

cmene <object> = expr:(!h &consonant_final coda? (any_syllable / digit)+ &pause) { new { label = "cmene", arg = _join(expr) } }

In addition, it seems that the cmavo rule was invoked very, very frequently and needed to be memoized. Without this memoization, the test below would take > 9 minutes, but with memoization the parse takes about a quarter of a second.

Please see my updated LojbanGrammar.peg

Parse_WhenUsingLojbhan_DoesntTimeOut("la .alis. co'a tatpi lo nu zutse lo rirxe korbi re'o lo mensi gi'e zukte fi no da")
initialTime: 260ms:
baseTime: 254ms:
warmupSamples: 1
warmupMean: 257ms:
testSamples: 30
testMean: 257ms±12ms:

I've started updating the LojbanGrammar.peg file to return the sentence structure as an example.

Here's the updated grammar along with the output of a sample program:
https://gist.github.com/otac0n/63d8fae45c551c4e8d41c83c53afc17e#file-output

You can see that the parser can derive the sentence structure from the output, but there are still many rules that need to be updated to return something more useful than a string. My approach would be to create strongly typed classes for each of the language features, to replace the anonymous types I have added. This will provide the best experience when working with the parsed data. You can instead use dynamic if you really would prefer to use anonymous types.

I'm beyond my understanding of the Lojban language, so I don't think I can continue updating the parser, but I hope this helps you get the information you were looking for.