sryza / aas

Code to accompany Advanced Analytics with Spark from O'Reilly Media

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Example from “Adv. Analytics with Spark”, Chapter 9 fails

mosicr opened this issue · comments

( Running the trials part )

val trials = seedRdd.flatMap(trialReturns(_, numTrials / parallelism, bFactorWeights.value, factorMeans, factorCov))

org.apache.spark.SparkException: Task not serializable

More detail:
Caused by: java.io.NotSerializableException: org.apache.commons.math3.stat.regression.OLSMultipleLinearRegression
Serialization stack:
- object not serializable (class: org.apache.commons.math3.stat.regression.OLSMultipleLinearRegression, value:
org.apache.commons.math3.stat.regression.OLSMultipleLinearRegression@5ed389d6)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 1)
- field (class: scala.collection.mutable.ArrayBuffer, name: array, type: class [Ljava.lang.Object;)
- object (class scala.collection.mutable.ArrayBuffer, ArrayBuffer(org.apache.commons.math3.stat.regression.OLSM
ultipleLinearRegression@5ed389d6))

Thanks for the detail. I suspect I can find a way to edit the code to avoid whatever causes this in all cases.

How are you running this -- in the shell? I don't see here how it would become part of something that's serialized.

Yes, I am running it in the shell.
spark-shell --jars ./jars/nscala-time_2.10-0.2.0.jar ./jars/jfreechart-1.0.14.jar ./jars/breeze_2.11-0.11.2.j
ar

then
:load /home/zkvsijk/source/MonteCarloSource

MonteCarloSource contains code from Chapter 9.

Can you share this source file so I see exactly what you're running?

MonteCarloSource.scala // source file
MonteCarlo.out // output
MonteCarlo.zip

@sryza can I ask you to look at this? I can't find the crude oil TSV data for example, in order to reproduce this from the source code above.

It shouldn't be a problem, but in the shell, something's causing the model objects to get into a closure even though it's not used with Spark. As a guess at a workaround, you might avoid making the models reference:

val models = stocksReturns.map(linearModel(_, factorFeatures))
val factorWeights = models.map(_.estimateRegressionParameters()).
  toArray

to

val factorWeights = stocksReturns.map(linearModel(_, factorFeatures)).map(_.estimateRegressionParameters()).
  toArray

Shouldn't be necessary but we're trying to avoid issues with the closure that are specific to the shell.

Tried above, fails with:
linearModel: (instrument: Array[Double], factorMatrix: Array[Array[Double]])org.apache.commons.math3.stat.regression.OLSMultipleLinear
Regression
:44: error: not found: value toArray
toArray
^

Oh, I think you copied and pasted it literally with the line break. It needs to be one command. It's just trying to avoid the models reference.

It is all in one line:
val factorWeights = stocksReturns.map(linearModel(_, factorFeatures)).map(_.estimateRegressionParameters()).toArray
still getting:
linearModel: (instrument: Array[Double], factorMatrix: Array[Array[Double]])org.apache.commons.math3.stat.regression.OLSMultipleLinear
Regression
:72: error: not found: value toArray
toArray
^

Hm, that does mean "toArray" has been entered as a statement by itself. Are you certain? The statement here should not be able to generate that error.

Yes, I am quite certain:
image

That code should be valid and equivalent to the existing code. Right? Something else funny must be going on in how it is being entered in the shell.

I don't know how to proceed on this one except to make the change I suggested above. It can't hurt, at least. Will do that. At least it would give you an unambiguously working piece of code to compare against.

Hello Everyone,

Does this book have Python version of code? If yes please help me to get the copies of those.

Thanks & Best Regards
Mukesh Ranjan

Hi @mukesh-ranjan,

The book only has Scala versions of the code.

-Sandy

Hi @srowen,
I got this error:
"object not serializable (class: org.apache.commons.math3.stat.regression.OLSMultipleLinearRegression, value: org.apache.commons.math3.stat.regression.OLSMultipleLinearRegression"

You are right,it's models referencence.
I use your code "val factorWeights = stocksReturns.map(linearModel(,factorFeatures)).map(.estimateRegressionParameters()).toArray"
to run
"val trials = seedRdd.flatMap..." successfully.

Thank you