JerryLead / SparkProfiler

Profiling Spark Applications for Performance Comparison and Diagnosis

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How can I use this code to profile my GC?

guimaluf opened this issue · comments

Hi all,

I read the article 'An Experimental Evaluation of GC on Big Data Applications' and I'm willing to reproduce part of it in my setup.
Isn't clear to me how can I use the SparkProfile.jar package. How it will get GC stats, where it will print output, etc.

I would like to thank you for the research and I appreciate any help

@guimaluf

Hi guimaluf, thanks for your interest in our work.

I'm sorry that it is a little complex to use this profiler, because I developed a number of parsers and analyzers to obtain statistics from task logs, gc logs, CPU logs, etc. Some of them are used to obtain the statistical results as presented in our paper, while others are obsolete. The usage of this profiler is as follows.

After running a Spark application, e.g., app-20170623113634-0010, we first run SparkAppJsonSaver.java to save this application's performance metrics (e.g., application execution time, stage metrics, task metrics in each stage, executor metrics, etc.) via REST APIs (referred to http://spark.apache.org/docs/latest/monitoring.html) to a directory (e.g., APPdir). This SparkAppJsonSaver.java also fetches the GC log from each executor to a file (e.g., to be APPdir/executors/executor-id/stdout), if we enable the executor to output GC log via GC commands as spark.executor.extraJavaOptions="-XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCApplicationStoppedTime. Note that each executor is a JVM and the GC activities are logged in executor's log stdout.

After that, we use SparkAppProfiler.java to analyze and output interesting statistics. In particular, for GC analysis, we use the gc log parsers in src/main/generalGC to parse the GC log of each executor into formatted statistics, such as

[Young](YGC) time = 2.083, beforeGC = 126.4658203125, afterGC = 14.2392578125, allocated = 151.25, gcPause = 0.0663858s, gcCause = GC (Allocation Failure) 
[Young](YGC) time = 2.877, beforeGC = 141.3876953125, afterGC = 9.5703125, allocated = 151.25, gcPause = 0.1134074s, gcCause = GC (Allocation Failure) 
[Young](FGC) time = 3.01, beforeGC = 26.33984375, afterGC = 26.33984375, allocated = 151.25, gcPause = 0.0014977s, gcCause = GC (CMS Initial Mark) 
[Young](YGC) time = 4.527, beforeGC = 144.0703125, afterGC = 10.8642578125, allocated = 151.25, gcPause = 0.1209985s, gcCause = GC (Allocation Failure)

This formatted statistics records the GC pause time and related memory usage after each young/old GC pause.

Finally, we can use the python code in src/python to plot the GC curves as that in Figure 7 in our paper.

In general, this profiler covers almost all the fine-grained metrics of a Spark application, including the metrics of application, stages, tasks, executors, etc. If you focus on analyzing the GC logs of executors, please refer to the parsers in src/main/generalGC. If you only want to observe the GC metrics of some executors, you can also refer to https://gceasy.io/ for GUI-based general GC analyzer.