adamtornhill / code-maat

A command line tool to mine and analyze data from version-control systems

Home Page:http://www.adamtornhill.com/code/codemaat.htm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cannot work with newlines with latest parsing update

LogicalChaos opened this issue · comments

I forked to implement a Perforce parser. I have it completed, but when I went to merge with your latest which splits the parsing into chunks, I'm unable to get it functioning again. I've found if I remove all newlines except those between change sets, I can get it parsing again. But, that involves quite a bit of data massaging to clean up the perforce log. Do you have any suggestions? If you look on my perforce branch, you can see my grammar.

Sounds cool with a Perforce parser - would definitely be a good addition.
First some background on my latest change: Instaparse is quite memory hungry. When I parsed the complete grammar in one pass, Code Maat run out of memory on larger logfiles. That's why I chose to split the log into smaller parts and feed those to Instaparse one by one (you'd run into the same problem with your current Perforce parser).

I see the you re-used the hiccup-based-parser. As you probably noticed, that's the one that does the chunking. In the current version, I split the log on each blank line (see function extend-when-complete). That works fine for both Git and Mercurial that don't have any blank lines within their entries. But, it won't work for Perforce that includes several blank lines in each entry.
I'd suggest that you identify a different criterion that's capable of identifying the end of a Perforce entry. Then you have to parameterize the hiccup-based-parser with that criterion (end-of-log-entry?perhaps).
Does that sound resonable?

What you said makes perfect sense, but is beyond me :-) I changed the log generation to make it consistent with the the others re blank lines ... | xargs -I commitid -n1 sh -c 'p4 describe -s commitid | grep -v "^\s*$" && echo "'. If you're up for it (I'd need major help), I'd like to add churn capabilities. The output Perforce spits out adds the following to each change set described.

Differences ...
==== //depot/project/Command.cpp#9 (text) ====
add 1 chunks 10 lines
deleted 0 chunks 0 lines
changed 0 chunks 0 / 0 lines

Thoughts?

Hmmm... With the new parser, I'm getting Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded, which is obviously different from the previous OOM problems. This is with change logs that failed previously. I'm looking to see if I can pinpoint what in the data is causing this. I suspect it's a change set with ~26000 files associated with it from a copy/merge operation.

I've seen the GC overhead limit exception as well on the earlier version of Code Maat before the memory optimization. Did you manage to get the chunking working now? That should solve this issue as well.
I'll have a look at your pull request during next week - thanks for the contribution!

Yes, I got the chunking working. The problem occurs with a change list of ~1400 with 35k lines when any individual change list goes over ~50 files. I can privately send you a problem file if you want.

Yes, please do that and I'll have a look. You can contact me at adam at adamtornhill dot com

I've sent two files, ~1MB compressed total.