I was hoping V4.0 would improve the memory/performance problems I encountered, but to my disappointment they are much worse now :-(

Question

I was hoping V4.0 would improve the memory/performance problems I encountered, but to my disappointment they are much worse now :-(

jbovensa opened this issue 8 years ago · comments

I am using a very big PEG file, which I can send to you.
As far as I understand it has much left-recursion (I did not write it)

With version 3.1 I could get so far as parsing a 10 word sentence,
but beyond that I would either get an OutOfMemoryException (if I chose to memoize all the rules) or the parser would run forever.

With version 4.0:
a. I had to remove some rules from the PEG because the compiler complained about ambiguous left recursion.
b. After I managed to get it to compile, I ran a 3 word sentence and got a memory overflow,
and without the memoization it just ran forever.

Please, please, please, help me.

I have been trying to work properly with this PEG file for ages.
I know it's linear-time parseable because a java parser exists over the file.

John Gietzen · Answer 1 · Thu Feb 09 2017 14:09:11 GMT+0800 (China Standard Time)

Can you send the grammar file? You can send it to my email, john@gietzen.us, if you prefer. I'll be happy to take a look.

John Gietzen · Answer 2 · Sun Feb 26 2017 04:33:25 GMT+0800 (China Standard Time)

I'm still looking into this.

jbovensa · Answer 3 · Sun Feb 26 2017 14:21:12 GMT+0800 (China Standard Time)

Thank you ever so much. I'm really depending on your help. Vensa

…

On Sat, Feb 25, 2017 at 10:33 PM, John Gietzen ***@***.***> wrote: I'm still looking into this. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#97 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AYQYjKUKgEdTh72SN-HlVtyTc2Q7kPnUks5rgJAVgaJpZM4Lxdoi> .

jbovensa · Answer 4 · Mon Mar 13 2017 17:19:47 GMT+0800 (China Standard Time)

Hello, Any luck understanding the problem? Thanks, Vensa

…

On Sat, Feb 25, 2017 at 10:33 PM, John Gietzen ***@***.***> wrote: I'm still looking into this. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#97 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AYQYjKUKgEdTh72SN-HlVtyTc2Q7kPnUks5rgJAVgaJpZM4Lxdoi> .

jbovensa · Answer 5 · Sun Apr 23 2017 22:03:30 GMT+0800 (China Standard Time)

Thanks. Could you please just let me know if you are still looking into this? Vensa

…

On Sat, Feb 25, 2017 at 11:33 PM, John Gietzen ***@***.***> wrote: I'm still looking into this. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#97 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AYQYjKUKgEdTh72SN-HlVtyTc2Q7kPnUks5rgJAVgaJpZM4Lxdoi> .

John Gietzen · Answer 6 · Mon Apr 16 2018 13:12:55 GMT+0800 (China Standard Time)

I see that you are trying to create a Lojban parser I created unit tests, based on the original Lojban grammar at https://github.com/mhagiwara/camxes.js/blob/master/camxes.js.peg, and found that the grammar doesn't actually compile in this form.

I get these errors:

error PEG0023: The rule 'erasable_clause' is ambiguously left-recursive. (erasable_clause, bu_clause_no_pre, pre_zei_bu, si_clause, si_word, zei_clause_no_pre)
error PEG0023: The rule 'bu_clause_no_pre' is ambiguously left-recursive. (erasable_clause, bu_clause_no_pre, pre_zei_bu, si_clause, si_word, zei_clause_no_pre)
error PEG0023: The rule 'pre_zei_bu' is ambiguously left-recursive. (erasable_clause, bu_clause_no_pre, pre_zei_bu, si_clause, si_word, zei_clause_no_pre)
error PEG0023: The rule 'si_clause' is ambiguously left-recursive. (erasable_clause, bu_clause_no_pre, pre_zei_bu, si_clause, si_word, zei_clause_no_pre)
error PEG0023: The rule 'si_word' is ambiguously left-recursive. (erasable_clause, bu_clause_no_pre, pre_zei_bu, si_clause, si_word, zei_clause_no_pre)
error PEG0023: The rule 'zei_clause_no_pre' is ambiguously left-recursive. (erasable_clause, bu_clause_no_pre, pre_zei_bu, si_clause, si_word, zei_clause_no_pre)

Can you help me resolve these issues? Would you be willing to look at the grammar? (You can find it here)

jbovensa · Answer 7 · Mon Apr 16 2018 18:23:02 GMT+0800 (China Standard Time)

Hi John, After not hearing from anybody about this issue for more than a year, I was so happy to hear today (which happens to be my birthday) that someone is working on it. It would make me so happy to see this issue resolved so that I may keep working on my project. As far as I remember, I had the got the same errors and simply omitted the erroneous clauses from the PEG file. Just for the sake of debugging. However, the previous version of Pegasus did not give these errors. So maybe you can check the different behavior. Thank you ever so much for your help, vensa

…

On Mon, Apr 16, 2018 at 8:12 AM, John Gietzen ***@***.***> wrote: I see that you are trying to create a Lojban <https://en.wikipedia.org/wiki/Lojban> parser I created unit tests <https://github.com/otac0n/Pegasus/blob/lojban-tests/Pegasus.Tests/RegressionTests.cs#L191>, based on the original Lojban grammar at https://github.com/mhagiwara/ camxes.js/blob/master/camxes.js.peg, and found that the grammar doesn't actually compile in this form. I get these errors: - error PEG0023: The rule 'erasable_clause' is ambiguously left-recursive. (erasable_clause, bu_clause_no_pre, pre_zei_bu, si_clause, si_word, zei_clause_no_pre) - error PEG0023: The rule 'bu_clause_no_pre' is ambiguously left-recursive. (erasable_clause, bu_clause_no_pre, pre_zei_bu, si_clause, si_word, zei_clause_no_pre) - error PEG0023: The rule 'pre_zei_bu' is ambiguously left-recursive. (erasable_clause, bu_clause_no_pre, pre_zei_bu, si_clause, si_word, zei_clause_no_pre) - error PEG0023: The rule 'si_clause' is ambiguously left-recursive. (erasable_clause, bu_clause_no_pre, pre_zei_bu, si_clause, si_word, zei_clause_no_pre) - error PEG0023: The rule 'si_word' is ambiguously left-recursive. (erasable_clause, bu_clause_no_pre, pre_zei_bu, si_clause, si_word, zei_clause_no_pre) - error PEG0023: The rule 'zei_clause_no_pre' is ambiguously left-recursive. (erasable_clause, bu_clause_no_pre, pre_zei_bu, si_clause, si_word, zei_clause_no_pre) Can you help me resolve these issues? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#97 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AYQYjOO6qK5fhqIaeiSGBWQfZCRzj_Mlks5tpChYgaJpZM4Lxdoi> .

John Gietzen · Answer 8 · Mon Apr 23 2018 12:42:58 GMT+0800 (China Standard Time)

Hi @jbovensa, I believe I have found and fixed the issues.

First of all, this rule was causing Pegasus to fail to compile as we discussed above:

cmene <object> = expr:(!h &consonant_final coda? (any_syllable / digit)* &pause) { new { label = "cmene", arg = _join(expr) } }

This is because this rule might be zero width, and other rules were not expecting this. After familiarizing myself a bit with the grammar, I believe the intent was to require at least a single iteration of the (any_syllable / digit) expression. The fix is to switch the * for + as here:

cmene <object> = expr:(!h &consonant_final coda? (any_syllable / digit)+ &pause) { new { label = "cmene", arg = _join(expr) } }

In addition, it seems that the cmavo rule was invoked very, very frequently and needed to be memoized. Without this memoization, the test below would take > 9 minutes, but with memoization the parse takes about a quarter of a second.

Please see my updated LojbanGrammar.peg

Parse_WhenUsingLojbhan_DoesntTimeOut("la .alis. co'a tatpi lo nu zutse lo rirxe korbi re'o lo mensi gi'e zukte fi no da")
initialTime: 260ms:
baseTime: 254ms:
warmupSamples: 1
warmupMean: 257ms:
testSamples: 30
testMean: 257ms±12ms:

jbovensa · Answer 9 · Mon Apr 23 2018 16:39:55 GMT+0800 (China Standard Time)

Hi John, Thank you ever so much for the time you invested in this matter. I successfully ran your PEG with the demo sentence. The problem is that I need to have access to the different terms in the sentence, Therefore I needed to mark all the terms as -lexical. I might not need ALL of them in the end, but in the meantime I don't know which I don't need. Anyway, this addition resulted in more time to parse, and an eventual OutOfMemoryException. Is there any way to get the sentence tree AND not cause this problem? Thanks again, vensa

…

On Mon, Apr 23, 2018 at 7:43 AM, John Gietzen ***@***.***> wrote: Hi @jbovensa <https://github.com/jbovensa>, I believe I have found and fixed the issues. First of all, this rule was causing Pegasus to fail to compile as we discussed above: cmene <object> = expr:(!h &consonant_final coda? (any_syllable / digit)* &pause) { new { label = "cmene", arg = _join(expr) } } This is because this rule *might* be zero width, and other rules were not expecting this. After familiarizing myself a bit with the grammar, I believe the intent was to require at least a single iteration of the (any_syllable / digit) expression. The fix is to switch the * for + as here: cmene <object> = expr:(!h &consonant_final coda? (any_syllable / digit)+ &pause) { new { label = "cmene", arg = _join(expr) } } In addition, it seems that the cmavo rule was invoked very, very frequently and needed to be memoized. Without this memoization, the test below would take > 9 minutes, but with memoization the parse takes about a quarter of a second. Please see my updated LojbanGrammar.peg <https://github.com/otac0n/Pegasus/blob/develop/Pegasus.Tests/TestCases/LojbanGrammar.peg> Parse_WhenUsingLojbhan_DoesntTimeOut("la .alis. co'a tatpi lo nu zutse lo rirxe korbi re'o lo mensi gi'e zukte fi no da") initialTime: 260ms: baseTime: 254ms: warmupSamples: 1 warmupMean: 257ms: testSamples: 30 testMean: 257ms±12ms: — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#97 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AYQYjN1exm_cyV-mztbIURsK-iidp3ttks5trVvVgaJpZM4Lxdoi> .

John Gietzen · Answer 10 · Tue Apr 24 2018 23:57:30 GMT+0800 (China Standard Time)

I've started updating the LojbanGrammar.peg file to return the sentence structure as an example.

Here's the updated grammar along with the output of a sample program:
https://gist.github.com/otac0n/63d8fae45c551c4e8d41c83c53afc17e#file-output

You can see that the parser can derive the sentence structure from the output, but there are still many rules that need to be updated to return something more useful than a string. My approach would be to create strongly typed classes for each of the language features, to replace the anonymous types I have added. This will provide the best experience when working with the parsed data. You can instead use dynamic if you really would prefer to use anonymous types.

I'm beyond my understanding of the Lojban language, so I don't think I can continue updating the parser, but I hope this helps you get the information you were looking for.

jbovensa · Answer 11 · Wed Apr 25 2018 21:55:16 GMT+0800 (China Standard Time)

Thanks John, I can't say that I understand what you did there. But in the meantime I have found a way around my problems. I added the -lexical tag only to some of the terms. And it works. So, for now I'm satisfied. Thanks again for all your help. Vensa

…

On Tue, Apr 24, 2018 at 6:57 PM, John Gietzen ***@***.***> wrote: I've started updating the LojbanGrammar.peg file to return the sentence structure as an example. https://gist.github.com/otac0n/63d8fae45c551c4e8d41c83c53afc1 7e#file-output You can see that the parser can derive the sentence structure from the output, but there are still many rules that need to be updated to return something more useful than a string. My approach would be to create strongly typed classes for each of the language features, to replace the anonymous types I have added. This will provide the best experience when working with the parsed data. You can instead use dynamic if you really would prefer to use anonymous types. I'm beyond my understanding of the Lojban language, so I don't think I can continue updating the parser, but I hope this helps you get the information you were looking for. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#97 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AYQYjHigMrM3ma83IJn68ZHo4GWw8sOwks5tr0trgaJpZM4Lxdoi> .