ClassCastException loading model in Apache Spark

Question

ClassCastException loading model in Apache Spark

timcroydon opened this issue 10 years ago · comments

Hi there,

I'm trying to use epic in an Apache Spark Streaming environment but I'm experiencing some difficulty loading the models. I'm not really sure whether this is an Epic issue, a Breeze issue, a Spark issue or where/how to solve this now! I get the following exception (for English NER):


Exception in thread "main" java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.HashMap$SerializationProxy to field epic.features.BrownClusterFeaturizer.epic$features$BrownClusterFeaturizer$$clusterFeatures of type scala.collection.immutable.Map in instance of epic.features.BrownClusterFeaturizer
    at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083)
    at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261)
    at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996)
    ... trimmed ...
    at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
    at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
    at breeze.util.package$.readObject(package.scala:21)
    at epic.models.package$.deserialize(package.scala:54)
        ... trimmed calls from my code ...

I've tried running my code (compiled into uberjar using 'sbt assembly') in a raw scala console and I can load the model and run it fine. However, using Spark, I get the exception described. The ONLY difference as far as I can tell is the way the model file is referenced. For the raw scala environment, I can point directly at the model file on disk (e.g. new File("mymodels/model.ser.gz")) and it loads. In Spark, I have to load the file doing something similar to:

sc.addFile("model.ser.gz")
new File(SparkFiles.get("model.ser.gz")

I've tried narrowing the code down and depending whether I point at the model extracted from the jar or the jar itself I get the same result. It's definitely loading the file (I think) as it fails in other ways if the file doesn't exist. I even tried bypassing the Breeze nonStupidObjectInputStream to no avail.

Any idea what's going on or how to test? For reference, my JVM is 1.7.0_51 and same in both scala and Spark environments.

Thanks.

David Hall · Answer 1 · Thu Nov 20 2014 01:40:53 GMT+0800 (China Standard Time)

I've seen this kind of problem a few times, and they are incredibly hard to
debug. It's usually a classloader problem, I think, and I'm unfortunately
not great at debugging (you can guess my frustration level last time I
debugged this, which is when I created nonstupidObjectInputStream...)

This is going to sound very hacky, but... could you try creating a new
class in epic's package explicitly before loading the model? Something as
simple as val x = new epic.features.BrownClusterFeature("foo")

You might also appeal to the spark user list. I'm happy to help with it as
best I can, but it isn't Epic-specific (I think!) and they have a lot more
expertise dealing with serialization problems caused by remoting and
classloaders.

-- David

On Wed, Nov 19, 2014 at 8:35 AM, Tim Croydon notifications@github.com
wrote:

Hi there,

I'm trying to use epic in an Apache Spark Streaming environment but I'm
experiencing some difficulty loading the models. I'm not really sure
whether this is an Epic issue, a Spark issue or where/how to solve this
now! I get the following exception (for English NER):

Exception in thread "main" java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.HashMap$SerializationProxy to field epic.features.BrownClusterFeaturizer.epic$features$BrownClusterFeaturizer$$clusterFeatures of type scala.collection.immutable.Map in instance of epic.features.BrownClusterFeaturizer
at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083)
at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996)
... trimmed ...
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at breeze.util.package$.readObject(package.scala:21)
at epic.models.package$.deserialize(package.scala:54)
... trimmed calls from my code ...

I've tried running my code (compiled into uberjar using 'sbt assembly') in
a raw scala console and I can load the model and run it fine. However,
using Spark, I get the exception described. The ONLY difference as far as I
can tell is the way the model file is referenced. For the raw scala
environment, I can point directly at the model file on disk (e.g. new
File("mymodels/model.ser.gz")) and it loads. In Spark, I have to load the
file doing something similar to:

sc.addFile("model.ser.gz")
new File(SparkFiles.get("model.ser.gz")

I've tried narrowing the code down and depending whether I point at the
model extracted from the jar or the jar itself I get the same result. It's
definitely loading the file (I think) as it fails in other ways if the file
doesn't exist. I even tried bypassing the Breeze
nonStupidObjectInputStream to no avail.

Any idea what's going on or how to test? For reference, my JVM is 1.7.0_51
and same in both scala and Spark environments.

Thanks.

—
Reply to this email directly or view it on GitHub
#17.

Tim Croydon · Answer 2 · Thu Nov 20 2014 03:01:04 GMT+0800 (China Standard Time)

I tried your suggestion and was able to create a BrownClusterFeature object with no trouble so doesn't look like it's a classloader issue (as far as I can tell). It feels more like the kind of problem you might get serialising using one version and trying to deserialise with another, although given the file can be deserialised using raw scala it's almost like something's happening to the file stream.

I'll have a closer look at the Spark side to see if I can find similar issues there.

Thanks for the prompt response and for the library!

David Hall · Answer 3 · Thu Nov 20 2014 03:03:56 GMT+0800 (China Standard Time)

Is there maybe something going on with different scala versions? (Or, less
likely, Breeze versions?)

On Wed, Nov 19, 2014 at 11:01 AM, Tim Croydon notifications@github.com
wrote:

I tried your suggestion and was able to create a BrownClusterFeature
object with no trouble so doesn't look like it's a classloader issue (as
far as I can tell). It feels more like the kind of problem you might get
serialising using one version and trying to deserialise with another,
although given the file can be deserialised it's almost like something's
happening to the file stream.

I'll have a closer look at the Spark side to see if I can find similar
issues there.

Thanks for the prompt response and for the library!

—
Reply to this email directly or view it on GitHub
#17 (comment).

Tim Croydon · Answer 4 · Thu Nov 20 2014 04:13:46 GMT+0800 (China Standard Time)

I'm compiling to 2.10.4 and my installed scala version matches that. However, there is a Breeze dependency at a different version - looks like nak pulls in an older version of breeze_natives:

'What depends on' Breeze 0.8:


[info] org.scalanlp:breeze_2.10:0.8 (evicted by: 0.9)
[info]   +-org.scalanlp:breeze-natives_2.10:0.8 [S]
[info]     +-org.scalanlp:nak_2.10:1.3 [S]
[info]       +-org.scalanlp:epic_2.10:0.2 [S]
[info]         +-my stuff
[info]         +-org.scalanlp:epic-ner-en-conll_2.10:2014.10.26 [S]
[info]         | +-my stuff
[info]         | 
[info]         +-org.scalanlp:epic-parser-en-span_2.10:2014.9.15 [S]
[info]           +-my stuff

And same for Breeze 0.9:


[info] org.scalanlp:breeze_2.10:0.9 [S]
[info]   +-org.scalanlp:breeze-natives_2.10:0.8 [S]
[info]   | +-org.scalanlp:nak_2.10:1.3 [S]
[info]   |   +-org.scalanlp:epic_2.10:0.2 [S]
[info]   |     +-my stuff
[info]   |     +-org.scalanlp:epic-ner-en-conll_2.10:2014.10.26 [S]
[info]   |     | +-my stuff
[info]   |     | 
[info]   |     +-org.scalanlp:epic-parser-en-span_2.10:2014.9.15 [S]
[info]   |       +-my stuff
[info]   |       
[info]   +-org.scalanlp:epic_2.10:0.2 [S]
[info]   | +-my stuff
[info]   | +-org.scalanlp:epic-ner-en-conll_2.10:2014.10.26 [S]
[info]   | | +-my stuff
[info]   | | 
[info]   | +-org.scalanlp:epic-parser-en-span_2.10:2014.9.15 [S]
[info]   |   +-my stuff
[info]   |   
[info]   +-org.scalanlp:nak_2.10:1.3 [S]
[info]     +-org.scalanlp:epic_2.10:0.2 [S]
[info]       +-my stuff
[info]       +-org.scalanlp:epic-ner-en-conll_2.10:2014.10.26 [S]
[info]       | +-kafkareader:kafkareader_2.10:0.1 [S]
[info]       | 
[info]       +-org.scalanlp:epic-parser-en-span_2.10:2014.9.15 [S]
[info]         +-my stuff

No idea if that might cause problems?

David Hall · Answer 5 · Thu Nov 20 2014 04:38:58 GMT+0800 (China Standard Time)

nak is declared intransitive() so that shouldn't be a problem. (Seems like a bug in the dependency graph plugin...)

Javier Santos · Answer 6 · Thu Apr 09 2015 01:10:26 GMT+0800 (China Standard Time)

Hi there,

I just googled, looking for a solution for a similar problem in a project I'm working in, and we found and fixed the problem cause (I'm not sure if it fixes your current problem).

We solved it adding missing classpath dependencies when creating SparkContext (not only direct dependencies):

  val sparkConf = new SparkConf().setJars("...") //Add all transitive dependencies that Spark workers might need.

Hope this helps.

Regards!

Adam Vogel · Answer 7 · Thu Jun 11 2015 05:02:52 GMT+0800 (China Standard Time)

@timcroydon Any chance you found a solution to this problem? Running into the same issue.

Simon Hafner · Answer 8 · Thu Jun 11 2015 05:05:26 GMT+0800 (China Standard Time)

@acvogel the solution @JSantosP provided doesn't work?

Tim Croydon · Answer 9 · Thu Jun 11 2015 05:10:01 GMT+0800 (China Standard Time)

I don't recall now, I'm afraid. For various unrelated reasons, we ended up using a different library for similar functionality so I don't think I ever got round to investigating this fully - sorry!

Adam Vogel · Answer 10 · Thu Jun 11 2015 05:34:15 GMT+0800 (China Standard Time)

@reactormonk I haven't gotten it to work by that route, but perhaps I'm missing something. I assemble the project into a single jar, and also add dependent jars:

SparkConf().setJars(Seq("/root/myBigJar.jar", "/root/epic-ner-en-conll_2.10-2015.1.25.jar", "/root/epic_2.10-0.3.jar"))

Perhaps I'm missing not following @JSantosP suggestion correctly, as those should be included in myBigJar.jar anyway.

@timcroydon Thanks for your reply!

David Hall · Answer 11 · Thu Jun 11 2015 05:44:12 GMT+0800 (China Standard Time)

there's a jar from february that works, i believe. can't fix atm.

On Wed, Jun 10, 2015 at 2:34 PM, acvogel notifications@github.com wrote:

@reactormonk https://github.com/reactormonk I haven't gotten it to work
by that route, but perhaps I'm missing something. I assemble the project
into a single jar, and also add dependent jars:

SparkConf().setJars(Seq("/root/myBigJar.jar",
"/root/epic-ner-en-conll_2.10-2015.1.25.jar", "/root/epic_2.10-0.3.jar"))

Perhaps I'm missing not following @JSantosP https://github.com/JSantosP
suggestion correctly, as those should be included in myBigJar.jar anyway.

@timcroydon https://github.com/timcroydon Thanks for your reply!

—
Reply to this email directly or view it on GitHub
#17 (comment).

Brian Topping · Answer 12 · Thu Jun 11 2015 06:48:17 GMT+0800 (China Standard Time)

I've been using the 2015.2.19 data files combined with the sources from https://github.com/dlwh/epic/tree/e0238ceb16fc9adb9511240638357e8c44200a2f. The files from February work, but I believe this tree is the last one that works. I covered some of it in #24 IIRC.

I don't know if this will solve your specific issue, but it is the latest version I believe will work. From there, maybe you could fix whatever CCE is holding back usage under Spark.

https://gist.github.com/briantopping/369fb337735c1b726337 is the complete dependency closure from the subproject I am using.

Loreto Fernández Costas · Answer 13 · Thu Jun 11 2015 16:42:52 GMT+0800 (China Standard Time)

I had the same problem and the JSantosP solutioin worked for me. Thank you.

tao · Answer 14 · Wed Jul 08 2015 18:17:51 GMT+0800 (China Standard Time)

What is the final solution, I have the same problem, I make a single jar file, on my local, it works, but when submit to Spark, throw exception java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.HashMap$SerializationProxy to field epic.features.BrownClusterFeaturizer.epic$features$BrownClusterFeaturizer$$clusterFeatures of type scala.collection.immutable.Map in instance of epic.features.BrownClusterFeaturizer

Who can help me, thanks a lot.

Adam Vogel · Answer 15 · Thu Jul 09 2015 04:54:35 GMT+0800 (China Standard Time)

@ltao80 I never got it to work and gave up. I'd be curious to hear from anyone else with a detailed solution.

tao · Answer 16 · Thu Jul 09 2015 15:49:53 GMT+0800 (China Standard Time)

@acvogel Thank you for your reply, I gave up too, I change to use Stanford NLP

Uli Fahrer · Answer 17 · Sat Dec 19 2015 02:29:11 GMT+0800 (China Standard Time)

I'm facing the same problem (see here [1]). I've tried @JSantosP suggestion and added several dependencies to the SparkConf.

val path = "/home/.../.../spark-fun/jars/"
    val conf = new SparkConf().setAppName("wordCount").setJars(Seq(
      path + "epic_2.10-0.3.jar",
      path + "epic-ner-en-conll_2.10-2015.1.25.jar",
      path + "nak_2.10-1.3.jar",
      path + "scala-logging-api_2.10-2.1.2.jar",
      path + "scala-logging-slf4j_2.10-2.1.2.jar",
      path + "breeze_2.10-0.11-M0.jar",
      path + "spark-assembly-1.5.2-hadoop2.6.0.jar",
      path + "spark-fun-assembly-1.0.jar"
    ))

Do I need the path here? I also wonder, why I should add these jars to the SparkConf. Using a fat jar that was generated with sbt assembly should be enough, right? The project dependency tree looks like [2]. Do I really need to add all of these dependencies to the SparkConf?

[1] https://github.com/Tooa/spark-fun
[2] https://gist.github.com/Tooa/a2d364d7d457c64dd68f