apache / datasketches-java

A software library of stochastic streaming algorithms, a.k.a. sketches.

Home Page:https://datasketches.apache.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Problems with using Datasketches in Spark applications.

priyamtejaswin opened this issue · comments

Hi,

I'm using ThetaSketches in my Spark application. I started by following the outline described in the Example of using ThetaSketch in Spark documentation.

Sketches become serializable through Java's ObjectInputStream and ObjectOutputStream. But since this is also used by Spark for its own serialization/deserialization (during shuffling, etc) I am hitting the size limit for the stream. The limit is 2GB, and is set by the jdk.

I was wondering what other options exist for massively parallelizing Sketches inside Spark apps.

Any thoughts, ideas are welcome. Thanks!

Sorry for the late response, but we are not familiar enough with Spark’s internals to give much guidance here. I would try some of the Spark community blogs or mail lists.