Cannot work with newlines with latest parsing update

Question

Cannot work with newlines with latest parsing update

LogicalChaos opened this issue 10 years ago · comments

I forked to implement a Perforce parser. I have it completed, but when I went to merge with your latest which splits the parsing into chunks, I'm unable to get it functioning again. I've found if I remove all newlines except those between change sets, I can get it parsing again. But, that involves quite a bit of data massaging to clean up the perforce log. Do you have any suggestions? If you look on my perforce branch, you can see my grammar.

Adam Tornhill · Answer 1 · Fri Jan 23 2015 05:16:43 GMT+0800 (China Standard Time)

Sounds cool with a Perforce parser - would definitely be a good addition.
First some background on my latest change: Instaparse is quite memory hungry. When I parsed the complete grammar in one pass, Code Maat run out of memory on larger logfiles. That's why I chose to split the log into smaller parts and feed those to Instaparse one by one (you'd run into the same problem with your current Perforce parser).

I see the you re-used the hiccup-based-parser. As you probably noticed, that's the one that does the chunking. In the current version, I split the log on each blank line (see function extend-when-complete). That works fine for both Git and Mercurial that don't have any blank lines within their entries. But, it won't work for Perforce that includes several blank lines in each entry.
I'd suggest that you identify a different criterion that's capable of identifying the end of a Perforce entry. Then you have to parameterize the hiccup-based-parser with that criterion (end-of-log-entry?perhaps).
Does that sound resonable?

Robert Creager · Answer 2 · Sat Jan 24 2015 03:54:21 GMT+0800 (China Standard Time)

What you said makes perfect sense, but is beyond me :-) I changed the log generation to make it consistent with the the others re blank lines ... | xargs -I commitid -n1 sh -c 'p4 describe -s commitid | grep -v "^\s*$" && echo "'. If you're up for it (I'd need major help), I'd like to add churn capabilities. The output Perforce spits out adds the following to each change set described.

Differences ...
==== //depot/project/Command.cpp#9 (text) ====
add 1 chunks 10 lines
deleted 0 chunks 0 lines
changed 0 chunks 0 / 0 lines

Thoughts?

Robert Creager · Answer 3 · Sat Jan 24 2015 04:09:33 GMT+0800 (China Standard Time)

Hmmm... With the new parser, I'm getting Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded, which is obviously different from the previous OOM problems. This is with change logs that failed previously. I'm looking to see if I can pinpoint what in the data is causing this. I suspect it's a change set with ~26000 files associated with it from a copy/merge operation.

Adam Tornhill · Answer 4 · Sun Jan 25 2015 00:31:12 GMT+0800 (China Standard Time)

I've seen the GC overhead limit exception as well on the earlier version of Code Maat before the memory optimization. Did you manage to get the chunking working now? That should solve this issue as well.
I'll have a look at your pull request during next week - thanks for the contribution!

Robert Creager · Answer 5 · Sun Jan 25 2015 02:20:04 GMT+0800 (China Standard Time)

Yes, I got the chunking working. The problem occurs with a change list of ~1400 with 35k lines when any individual change list goes over ~50 files. I can privately send you a problem file if you want.

Adam Tornhill · Answer 6 · Mon Jan 26 2015 00:32:24 GMT+0800 (China Standard Time)

Yes, please do that and I'll have a look. You can contact me at adam at adamtornhill dot com

Robert Creager · Answer 7 · Mon Jan 26 2015 07:41:35 GMT+0800 (China Standard Time)

I've sent two files, ~1MB compressed total.