dibbhatt / kafka-spark-consumer

High Performance Kafka Connector for Spark Streaming.Supports Multi Topic Fetch, Kafka Security. Reliable offset management in Zookeeper. No Data-loss. No dependency on HDFS and WAL. In-built PID rate controller. Support Message Handler . Offset Lag checker.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

getting GC : outofmemory exception

sorabh89 opened this issue · comments

Hi,

I am using this consumer with the following settings:

.set("spark.cleaner.ttl", "800")
.set("spark.executor.memory", "8g")
.set("spark.driver.memory", "8g")
.set("spark.driver.maxResultSize", "10g")

I need to do my aggregations on 10 minutes of data, so I am creating JavaStreamingContext with 10 minutes (600000 milliseconds).

The data is around 5 - 7 lakh records every 10 min.

The program is throwing GC outofmemory exception, so I increased the jvm heap size to 16 GB.
But still I am getting the same error after 20-30 minutes.

And when I checked the memory consumption, this process is consuming around 57-58 percent of my memory. My machine's total memory is 30 GB.

Please let me know what could be the reason behind this.

Thanks,

How many partition you have in your topic ?

The issue here is you are consuming at much faster rate than you can process the batch, and thus you have a back pressure building on your memory . There is a existing JIRA which try to tackle the back pressure issue, but till it get done, you need to lower your data ingestion rate by following two properties . You can find more details in the Readme file.

consumer.fetchsizebytes default is 500KB
consumer.fillfreqms default is 250 Ms

You can set in the Properties which you send to ReceiverLauncher.
What this mean , every Receiver fetch 500KB data for every 250Ms during each fill , and your Block size is 500 KB.
let assume your Kafka Topic have 5 partitions, and your Spark Batch Duration is say 10 Seconds, this Consumer will pull

512 KB x ( 10 seconds / 250 ms ) x 5 = 100 MB of data for every Batch.

You can control your Block Size and Number of Blocks per Batch using these two parameter.

As your ingestion rate is much higher than your processing rate, you can do following tuning

  1. Increase the consumer.fillfreqms to say 500Ms . This will create less number of blocks per Batch Duration .
  2. Decrease the consumer.fetchsizebytes to say 256 KB ( 262144 bytes) , this will fetch less data for every block .

You can either try one or use both based on your data processing rate . It needs little hit and try to come up with a ideal number .

Hope this will help.

How to find the data processing rate of my application. Because even after trying a lot of combinations I'm still not able to make my memory consumption in limit, the rate has changed for sure earlier it was throwing error in 23-30 min. now its throwing the same error in 1-2 hours.

Since I need to do my aggregations on 10 minutes of data, so definitely it'll be around 1GB of data, how to deal with it.

I am using 10 receivers.

Have you tried the Spark UI ? There you can see nice stats about your processing delay , scheduling delay and other stats .

After reducing the size of batch I'm not getting any error, but because of this change I'm also not able to achieve real time data.
And since its an ever running process the lag will keep on increasing.

Please suggest a way to achieve processing huge real time data.

Is using spark cluster going to help me achieve this?

The main issue is in your computation logic which seems to not optimized . Even at reduced batch size does not help total delay, you need to revisit how you are doing the DStream processing or may be issue with wrong cluster sizing. By the way, how many Cores you have in your cluster and how many used by Receivers ?

Here is some good tuning options which may help you : http://www.virdata.com/tuning-spark/

Thanks Dibbhatt,

This application is working fine standalone, but as soon as I deploy it in a clustered environment I stated getting the following error:

java.lang.Exception: Could not compute split, block input-0-1436371770680 not found
at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

and another error in the worker log :

ERROR FileAppender: Error writing stream to file spark-1.3.1-bin-hadoop2.6/work/app-20150708144108-0000/0/stderr
java.io.IOException: Stream closed
at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:272)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618)
at org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)
15/07/08 14:42:03 INFO Worker: Asked to launch executor app-20150708144200-0001/0 for KafkaReceiver

After searching a lot 1 thing that I understood is that it's an issue with the data being spill out to disk.

So I reduced the batch size to extreme low but still I'm getting this error.

Spark is dropping older blocks from Memory in LRU based . You can use MEMORY_AND_DISK storage level to solve this

I am using memory_and_disk_ser storage level, isn't it better?

SER takes less storage space , but more CPU intensive thus reading and writing is much slower . Whereas block not serialized takes larger memory but faster to read/write...

Are you saying , even after using MEMORY_AND_DISK_SER you still get BlockNotFound Exception ?

Are you settings the StorageLevel with ReceiverLauncher like this ?

JavaDStream unionStreams = ReceiverLauncher.launch(jsc, props, numberOfReceivers,StorageLevel.MEMORY_AND_DISK_SER());

YES, I'm setting it like this.

I am getting this error when I use StorageLevel.MEMORY_ONLY() .
Not able to understand the reason behind this because all I've understood about this error is related to disk storage but even with memory_only also I'm getting it.
When the records coming from stream are 0 (zero) then it is working fine.
I tried to debug it and found that this error is raised due to rdd.count()

You need to understand how Spark does memory Management . Its little complicated and has some issue .

There are two possible cases Blocks may get dropped / not stored in memory

Case 1. While writing the Block for MEMORY_ONLY_* settings , if Node's BlockManager does not have enough memory, Block wont be stored in memory and Receiver will throw error while writing the Block. If Storage Level is using Disk ( as in case MEMORY_AND_DISK_*) , while writing the block , blocks will be stored to Disk "ONLY if" Memory if full...

Case 2 : Now let say either for MEMORY_ONLY_* or MEMORY_AND_DISK_* settings , blocks are successfully stored to Memory in Case 1 . Now what would happen if memory limit goes beyond a certain threshold, BlockManager start dropping LRU blocks from memory which was successfully stored while receiving.

Primary issue here, while dropping the block in Case 2 , Spark does not check if storage level is MEMORY_AND_DISK_* , and even with DISK* storage levels blocks is drooped from memory without writing it to Disk. Or I believe the issue at the first place blocks are NOT written to Disk simultaneously while writing in Case 1 .

What you are seeing is Case 2 , that Blocks are chosen to be evicted from memory as memory threshold increases and those are unprocessed blocks and while Job tries to find the blocks, it does not find in Memory , but same block does not have any replica anywhere .

So there are two possible way to Fix this .

  1. Replicate the block : Use MEMORY_ONLY_2 ( or MEMORY_AND_DISK_2), so that same block will be replicated in remote BlockManager and you will have a lesser chance of BlockNotFound error.
  2. Use WriteAheadLog : Set spark.streaming.receiver.writeAheadLog.enable to true and use HDFS as checkpoint directory . As WAL will have a copy of every in memory blocks , you can just use MEMORY_ONLY settings ( No need to use DISK settings)

I hope either of this will solve your problem .

Hi,

Just to reconfirm , are you see this BlockNotFound Exception even in the memory_and_disk_ser StorageLevel case ? What I explained above I clarified with Spark user group , and their expert says that in case of memory_and_disk* settings, even if blocks drop from memory as I said for Case 2 , it is written back to Disk. But if you see the issue , I guess there is a bug .

Nevertheless , you can try the workaround I mentioned to see if that helps.

I'm able to process it but facing problems with the pocessing rate. I believe it is because of the GC, the UI shows that the GC Time and processing time are almost equal.
As per your comments this is because of the code that I've written but then all that is the part of my requirement.

I am doing a lot of aggregations on that stream so definitely the processing will be slow and I am getting huge amount of data as well, but then this is what my requirement is.
So what should I do to increase the processing rate so that I can do all the required aggregations on huge amount of data without any delay.

Should I increase the worker nodes in my cluster or anything else?

I am also a bit confused about the cluster configuration,
I am using 2 worker nodes with 4 worker instances on each and 8G memory on each with 16 cores.

I am also a bit confused about the use of appropriate methods,
like mapToPair() and transformToPair() both will provide me the same result but which one is better?

hi @jedisct1 @akhld @sorabh89

Just now released version 1.0.4 of Kafka Spark Consumer in spark-packages.

http://spark-packages.org/package/dibbhatt/kafka-spark-consumer

This consumer now can control the Spark Memory Back Pressure problem by having a inbuilt PID ( Proportional , Integral , Derivative ) Controller to rate limit the block size at run time. Please refer to Readme how this is different from existing Out-of-the-Box Spark Kafka Consumer, and how this is different from upcoming Spark 1.5 Back-Pressure implementation.

This is awesome, thanks! Gonna give it a spin today.

Cool. Let me know how this goes . The Latest version is 1.0.4.

The MVN and SBT dependencies are there in Spark-Packages link mentioned above.

Running it in production. So far so good!

Nice .. Thanks for trying it out .

commented

Cool man.
On 27 Aug 2015 13:10, "Dibyendu Bhattacharya" notifications@github.com
wrote:

Nice .. Thanks for trying it out .


Reply to this email directly or view it on GitHub
#21 (comment)
.