adamtornhill / code-maat

A command line tool to mine and analyze data from version-control systems

Home Page:http://www.adamtornhill.com/code/codemaat.htm

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

OOM when processing really large git logs

rjayasinghe opened this issue · comments

commented

Hi!

I tried to process a pretty large git log from a private git repo. I increased to max heap to 4GB but it still did not help. Much more heap would not go as my laptop's memory is limited.

Best Regards,
Robin

Hi @rjayasinghe

I've analyzed fairly rich Git repositories (e.g. Rails with 10 years history, Mono with +10 years) and Code Maat's memory usage stays around 1.3 GB on those. I think your issue has to do with some pattern in your input data combined with some inefficiency in the analysis algorithms.

What analysis did you run?

Would it be possible for you to send me the git log? That would allow me to debug it. In the meantime I'd recommend that you use a shorter analysis time span until I've addressed the real problem.

commented

Hi!

Sorry, I cannot share the git log. It's built from a +10GB repository with ~15 years of history.

This is how I called code-maat:

java -Xmx4g -jar code-maat-0.9.2-SNAPSHOT-standalone.jar -l 

I know it's not very helpful if I cannot share the git log - but I at least wanted to share that your analysis algorithms run into problems when analyzing really large data sets..

Best Regards,
Robin

Alright, no problem. I will see if I can find some even larger open-source project where I can reproduce the problem.

Did any of the analyses work? For example, try -a identity. That would help me to isolate the potential problem.

commented

-a identity resulted in OOM as well:

WARNING: update already refers to: #'clojure.core/update in namespace: incanter.core, being replaced by: #'incanter.core/update
Exception in thread "main" java.lang.OutOfMemoryError
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:536)
        at java.util.concurrent.ForkJoinTask.reportResult(ForkJoinTask.java:596)
        at java.util.concurrent.ForkJoinTask.join(ForkJoinTask.java:640)
        at java.util.concurrent.ForkJoinPool.invoke(ForkJoinPool.java:1521)
        at clojure.core.reducers$fjinvoke.invoke(reducers.clj:49)
        at clojure.core.reducers$foldvec.invoke(reducers.clj:341)
        at clojure.core.reducers$fn__1915.invoke(reducers.clj:362)
        at clojure.core.reducers$fn__1798$G__1793__1809.invoke(reducers.clj:81)
        at clojure.core.reducers$fold.invoke(reducers.clj:98)
        at code_maat.parsers.hiccup_based_parser$parse_from.invoke(hiccup_based_parser.clj:139)
        at code_maat.parsers.hiccup_based_parser$parse_log.invoke(hiccup_based_parser.clj:158)
        at code_maat.parsers.git2$parse_log.invoke(git2.clj:74)
        at code_maat.app.app$git2__GT_modifications$fn__9421.invoke(app.clj:133)
        at code_maat.app.app$run_parser_in_error_handling_context.invoke(app.clj:97)
        at code_maat.app.app$git2__GT_modifications.invoke(app.clj:132)
        at code_maat.app.app$parse_commits_to_dataset.invoke(app.clj:202)
        at code_maat.app.app$run.invoke(app.clj:215)
        at code_maat.cmd_line$_main.doInvoke(cmd_line.clj:66)
        at clojure.lang.RestFn.applyTo(RestFn.java:137)
        at code_maat.cmd_line.main(Unknown Source)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
        at clojure.lang.PersistentHashMap.cloneAndSet(PersistentHashMap.java:1169)
        at clojure.lang.PersistentHashMap.access$000(PersistentHashMap.java:28)
        at clojure.lang.PersistentHashMap$ArrayNode.assoc(PersistentHashMap.java:418)
        at clojure.lang.PersistentHashMap.assoc(PersistentHashMap.java:142)
        at clojure.lang.PersistentHashMap.assoc(PersistentHashMap.java:28)
        at clojure.lang.RT.assoc(RT.java:778)
        at clojure.core$assoc__4142.invoke(core.clj:191)
        at clojure.lang.Atom.swap(Atom.java:65)
        at clojure.core$swap_BANG_.invoke(core.clj:2240)
        at instaparse.gll$node_get.invoke(gll.clj:286)
        at instaparse.gll$push_listener.invoke(gll.clj:339)
        at instaparse.gll$non_terminal_parse.invoke(gll.clj:818)
        at instaparse.gll$_parse.invoke(gll.clj:119)
        at instaparse.gll$push_listener$fn__1307.invoke(gll.clj:348)
        at instaparse.gll$step.invoke(gll.clj:409)
        at instaparse.gll$run.invoke(gll.clj:427)
        at instaparse.gll$run.invoke(gll.clj:413)
        at instaparse.gll$parse.invoke(gll.clj:894)
        at instaparse.core$parse.doInvoke(core.clj:91)
        at clojure.lang.RestFn.invoke(RestFn.java:425)
        at code_maat.parsers.hiccup_based_parser$parse_with.invoke(hiccup_based_parser.clj:27)
        at clojure.core$partial$fn__4527.invoke(core.clj:2493)
        at code_maat.parsers.hiccup_based_parser$parse_entry.invoke(hiccup_based_parser.clj:40)
        at code_maat.parsers.hiccup_based_parser$parse_entry_from.invoke(hiccup_based_parser.clj:47)
        at code_maat.parsers.hiccup_based_parser$parse_from$fn__1950.invoke(hiccup_based_parser.clj:144)
        at clojure.core.protocols$iter_reduce.invoke(protocols.clj:49)
        at clojure.core.protocols$fn__6510.invoke(protocols.clj:112)
        at clojure.core.protocols$fn__6452$G__6447__6465.invoke(protocols.clj:13)
        at clojure.core.reducers$reduce.invoke(reducers.clj:79)
        at clojure.core.reducers$foldvec.invoke(reducers.clj:335)
        at clojure.core.reducers$foldvec$fc__1904$fn__1905.invoke(reducers.clj:340)
        at clojure.core.reducers$foldvec$fn__1908.invoke(reducers.clj:345)

Thanks for the info, @rjayasinghe !
I've tested the last released version of Code Maat, 0.9.1, on a large repository and it seems to be able to handle it. If you have the possibility, please try version 0.9.1 (available here and let me know if that solves your problem; We did some parallelization in the parsing stage of 0.9.2 and it might have introduced the problem (but I'm not sure yet).

commented

OK. I downloaded and built 0.9.1 from github. This time it ran longer. However, after ~1,5 hours the process died with

WARNING: update already refers to: #'clojure.core/update in namespace: incanter.core, being replaced by: #'incanter.core/update
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
        at clojure.lang.PersistentHashMap.cloneAndSet(PersistentHashMap.java:1169)
        at clojure.lang.PersistentHashMap.access$000(PersistentHashMap.java:28)
        at clojure.lang.PersistentHashMap$ArrayNode.assoc(PersistentHashMap.java:414)
        at clojure.lang.PersistentHashMap$ArrayNode.assoc(PersistentHashMap.java:415)
        at clojure.lang.PersistentHashMap$ArrayNode.assoc(PersistentHashMap.java:415)
        at clojure.lang.PersistentHashMap.assoc(PersistentHashMap.java:142)
        at clojure.lang.PersistentHashMap.assoc(PersistentHashMap.java:28)
        at clojure.lang.RT.assoc(RT.java:778)
        at clojure.core$assoc__4142.invoke(core.clj:191)
        at clojure.lang.Atom.swap(Atom.java:65)
        at clojure.core$swap_BANG_.invoke(core.clj:2240)
        at instaparse.gll$node_get.invoke(gll.clj:286)
        at instaparse.gll$push_listener.invoke(gll.clj:339)
        at instaparse.gll$CatListener$fn__1340.invoke(gll.clj:487)
        at instaparse.gll$push_message$f__1269.invoke(gll.clj:238)
        at instaparse.gll$step.invoke(gll.clj:409)
        at instaparse.gll$run.invoke(gll.clj:427)
        at instaparse.gll$run.invoke(gll.clj:413)
        at instaparse.gll$parse.invoke(gll.clj:894)
        at instaparse.core$parse.doInvoke(core.clj:91)
        at clojure.lang.RestFn.invoke(RestFn.java:425)
        at code_maat.parsers.hiccup_based_parser$parse_with.invoke(hiccup_based_parser.clj:26)
        at clojure.core$partial$fn__4527.invoke(core.clj:2493)
        at code_maat.parsers.hiccup_based_parser$parse_entry.invoke(hiccup_based_parser.clj:47)
        at code_maat.parsers.hiccup_based_parser$parse_entry_from.invoke(hiccup_based_parser.clj:55)
        at code_maat.parsers.hiccup_based_parser$extend_when_complete.invoke(hiccup_based_parser.clj:62)
        at code_maat.parsers.hiccup_based_parser$as_entry_tokens.invoke(hiccup_based_parser.clj:82)
        at code_maat.parsers.hiccup_based_parser$parse_from.invoke(hiccup_based_parser.clj:158)
        at code_maat.parsers.hiccup_based_parser$parse_log.invoke(hiccup_based_parser.clj:172)
        at code_maat.parsers.git2$parse_log.invoke(git2.clj:74)
        at code_maat.app.app$git2__GT_modifications$fn__9279.invoke(app.clj:133)
        at code_maat.app.app$run_parser_in_error_handling_context.invoke(app.clj:97)

Best Regards,
Robin

My heap space runs out of memory for wikimedia/mediawiki

The evo-log file, produced as described in the book, has 23MB.

Setting up the JVM heap size in the .bat-file does not fix this problem:

java -Xmx512M -Xms64M -jar t\winmaat0.8.5\code-maat-0.8.5-standalone.jar -l ../mediawiki/maat_evo.log -c git -a summary
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at clojure.lang.PersistentVector.rangedIterator(PersistentVector.java:238)
at clojure.lang.PersistentVector.iterator(PersistentVector.java:261)
at clojure.lang.Murmur3.hashOrdered(Murmur3.java:105)
at clojure.lang.APersistentVector.hasheq(APersistentVector.java:166)
at clojure.lang.Util.dohasheq(Util.java:177)
at clojure.lang.Util.hasheq(Util.java:168)
at clojure.lang.PersistentHashMap.hash(PersistentHashMap.java:120)
at clojure.lang.PersistentHashMap.valAt(PersistentHashMap.java:152)
at clojure.lang.RT.get(RT.java:672)
at instaparse.gll$push_message.invoke(gll.clj:172)
at instaparse.gll$push_result.invoke(gll.clj:255)
at instaparse.gll$NodeListener$fn__588.invoke(gll.clj:374)
at instaparse.gll$push_message$f__524.invoke(gll.clj:173)
at instaparse.gll$step.invoke(gll.clj:328)
at instaparse.gll$run.invoke(gll.clj:344)
at instaparse.gll$run.invoke(gll.clj:332)
at instaparse.gll$parse.invoke(gll.clj:758)
at instaparse.core$parse.doInvoke(core.clj:83)
at clojure.lang.RestFn.invoke(RestFn.java:425)
at code_maat.parsers.hiccup_based_parser$parse_with.invoke(hiccup_based_parser.clj:26)
at clojure.lang.AFn.applyToHelper(AFn.java:156)
at clojure.lang.AFn.applyTo(AFn.java:144)
at clojure.core$apply.invoke(core.clj:626)
at clojure.core$partial$fn__4228.doInvoke(core.clj:2468)
at clojure.lang.RestFn.invoke(RestFn.java:408)
at code_maat.parsers.hiccup_based_parser$parse_entry.invoke(hiccup_based_parser.clj:48)
at code_maat.parsers.hiccup_based_parser$parse_entry_from.invoke(hiccup_based_parser.clj:54)
at code_maat.parsers.hiccup_based_parser$extend_when_complete.invoke(hiccup_based_parser.clj:62)
at code_maat.parsers.hiccup_based_parser$as_entry_tokens.invoke(hiccup_based_parser.clj:82)
at code_maat.parsers.hiccup_based_parser$parse_from.invoke(hiccup_based_parser.clj:157)
at code_maat.parsers.hiccup_based_parser$parse_log.invoke(hiccup_based_parser.clj:172)
at code_maat.parsers.git$parse_log.invoke(git.clj:62)

This is running the version downloaded from https://www.adamtornhill.com/code/crimescenetools.htm.