qubole / sparklens

Qubole Sparklens tool for performance tuning Apache Spark

Home Page:http://sparklens.qubole.com

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MaxTaskMem 0.0 KB allways

kretes opened this issue · comments

Hello there! thanks for the tool, looks promising. nice thing that it works on historical events data!

I tried that and got a 'Max memory which an executor could have taken = 0.0 KB'. and everywhere a 0.0 KB in the 'MaxTaskMem' column which is obviously wrong since the job ran and used some memory.
Other metrics look at least sane.

I wonder what can be the case?

My ultimate goal is to find out how much can I lower the memory footprint of a job and do not get OOM, this 'MaxTaskMem' field looks like a good fit for that, right?

Thanks @kretes for trying this out. We have seen this issue but never debugged it. I took a look. Looks like the JsonProtcol.scala, the main class responsible for converting events to JSON (which gets logged into event log files) is not handling the peak execution memory metric (part of TaskMetrics). We can try to get this patched into open source, but that will take its own sweet time.

For now, the only useful way to get this value will be to run sparklens as part of your application. It is possible to configure sparklens to not do simulations. In this mode, it will save a sparklens output file (typically few MBs in size). You can use this output file just like event log history files and get the same out (with simulation) at any later point in time.

Thanks again for sharing your findings.

Thanks @iamrohit - it works when run together with the diagnosed job.

Still I wonder how should I interpret the MaxTaskMem.
In the code sparklens is doing such a thing:

val maxTaskMemory = sts.taskPeakMemoryUsage.take(executorCores.toInt).sum // this could
// be at different times?

And if it is a sum of that many peakMemoryUsages as many executorCores are there - then it is not a 'task' metric, as one task goes on one core, right?

I got the result of 16 GB in one stage, and given executor has that much memory - it is not available to a task, if a few of them are running concurrently on the same executor.

MaxTaskMem is very wrong name! Let me first take this as an action item.

What is MaxTaskMem?
Executor memory is divided into user-memory and spark-memory. The spark-memory is further divided into native and on heap and also into storage and execution. For every task, spark reports maximum (peak) execution memory used by a task. The goal of MaxTaskMem is to figure out what is the worst case execution memory requirement per stage for any executor. We compute this number by first sorting all tasks of a stage by peakExecutionMemory and pick top N values, where N is number of cores per executor. Essentially if all largest tasks of a stage end up running on a single executor, how much execution memory they will need.

Usually by default spark allocates around 60%/70% of memory for execution+storage. This means, if we pick the max value of MaxTaskMem across all stage, and say double it, we are certain that in no stage, any possible combination of N tasks (where N = number of cores per executor) can cause executor to run out of execution memory. Note that we are not accounting for user-memory, so it doesn't really means no OOM. That still remains a possibility. This probably means no spill. But again, spill is mostly harmless. May be slows down, but will not kill the executor.

Please take a look here for some imore information. https://www.qubole.com/blog/an-introduction-to-apache-spark-optimization-in-qubole/

I will probably be raising a pull request for a different feature which might be useful for memory management in spark. We call it GC aware task scheduling. Essentially instead of right-sizing the memory per executor, this feature changes the number of active cores per executor dynamically, thus making memory available for huge tasks at the expense of some loss of concurrency. Hopefully this will settle the question of memory tuning spark once for all.

Hello!

Thanks for the thorough explanation.

I would then call this metric 'WorstCaseExecutorMemUsage'. And this is actually what I am looking for. Although in my case the worst case didn't happen at all since it was way more than executor memory.
I will be looking at the result of this metric in a few more jobs.

The feature you are describing sounds like a good thing to do! Looking forward to it!

@iamrohit - Would the calculation of maxMem in StageSkewAnalyzer.scala still be valid in case there are different (independent) stages which are executing in parallel. Currently sparklens looks at stage level maxMemory and then finds global maximum - What if there are are 2 or more tasks belonging to different stages which get scheduled on the same executor. In this case should we take sum of (vcore number of ) tasks with highest peakExecution memory?

You are right @anupamaggarwal. We should probably change the logic to account for parallel stages. What we have seen in practice is that this worst case is typically very unlikely and even if this occurs and executor goes out of memory, the execution typically proceeds as some of the tasks will go to other executors on retry. Feel free to raise PR.

I know this a very old issue, but are there any plans to fix this? This is an extremely useful metric to have.