CoreNLP 3.8 fails in Apache Spark
maziyarpanahi opened this issue · comments
Hi,
I can use CoreNLP 3.6 and 3.7 simply by calling these jars in my Spark app (1.6 and 2.2):
spark-shell --master yarn --deploy-mode client --queue multivac --driver-cores 5 --driver-memory 8g --executor-cores 5 --executor-memory 4g --num-executors 30 --jars /home/jars/stanford-corenlp-3.7.0/ejml-0.23.jar,/home/jars/stanford-corenlp-3.7.0/stanford-corenlp-3.7.0.jar,/home/jars/stanford-corenlp-3.7.0/stanford-corenlp-3.7.0-models.jar,/home/jars/stanford-corenlp-3.7.0/protobuf.jar,/home/jars/stanford-corenlp-3.7.0/jollyday.jar
But if I try the same set of jars from CoreNLP 3.8 it always fails with this error:
scala> import edu.stanford.nlp.simple._
scala> new Sentence(document).words()
java.lang.VerifyError: Bad type on operand stack
Exception Details:
Location:
com/google/protobuf/GeneratedMessageV3$ExtendableMessage.getExtension(Lcom/google/protobuf/GeneratedMessage$GeneratedExtension;I)Ljava/lang/Object; @3: invokevirtual
Reason:
Type 'com/google/protobuf/GeneratedMessage$GeneratedExtension' (current frame, stack[1]) is not assignable to 'com/google/protobuf/ExtensionLite'
Current Frame:
bci: @3
flags: { }
locals: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 'com/google/protobuf/GeneratedMessage$GeneratedExtension', integer }
stack: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 'com/google/protobuf/GeneratedMessage$GeneratedExtension', integer }
Bytecode:
0x0000000: 2a2b 1cb6 0024 b0
at edu.stanford.nlp.simple.Document.<init>(Document.java:433)
at edu.stanford.nlp.simple.Sentence.<init>(Sentence.java:118)
at edu.stanford.nlp.simple.Sentence.<init>(Sentence.java:126)
... 56 elided
Any help is appreciated,
Cheers,
Maziyar
I have a theory which may be incorrect.
Look at this page on Maven Central:
https://search.maven.org/#artifactdetails%7Corg.apache.spark%7Cspark-parent%7C1.2.2%7Cpom
You'll notice that this project relies on protobuf 2.4.1. Stanford CoreNLP uses protobuf 3.2.0.
I think the mismatch is causing this problem. The bad news is I'm not sure how to resolve this. This page also claims Spark doesn't directly use protobuf, so you could look at your Spark installation and the jars it uses and see if you can manually upgrade to the protobuf 3.2.0 jar.
Even more evidence of the protobuf conflict:
My advice is to figure out where Spark's jar dependencies are, and manually change the dependency to 3.2.0 and see if that fixes things.
Hi @J38
Good catch! You are right the Spark pulls the older version 2.5.0 of protobuf-java and when I remove the current protobuf-java-2.5.0.jar and replace it with the latest version (following) it works without any problem.
http://central.maven.org/maven2/com/google/protobuf/protobuf-java/3.4.0/protobuf-java-3.4.0.jar
I opened an issue to see if it is possible to bump the version of protobuf to a newer version:
https://issues.apache.org/jira/browse/SPARK-22380
Many thanks @J38 for your catch, I am going to manually use 3.4 instead of 2.5 and I hope in the next release of Spark this dependency is already the newest version.
Cheers,
Maziyar
I believe that hadoop itself has the dependency on protobuf. This is going to be fixed in 3.0. See https://issues.apache.org/jira/browse/HADOOP-11804