stanfordnlp / CoreNLP

CoreNLP: A Java suite of core NLP tools for tokenization, sentence segmentation, NER, parsing, coreference, sentiment analysis, etc.

Home Page:http://stanfordnlp.github.io/CoreNLP/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

CoreNLP 3.8 fails in Apache Spark

maziyarpanahi opened this issue · comments

Hi,

I can use CoreNLP 3.6 and 3.7 simply by calling these jars in my Spark app (1.6 and 2.2):

spark-shell --master yarn --deploy-mode client --queue multivac --driver-cores 5 --driver-memory 8g --executor-cores 5 --executor-memory 4g --num-executors 30 --jars /home/jars/stanford-corenlp-3.7.0/ejml-0.23.jar,/home/jars/stanford-corenlp-3.7.0/stanford-corenlp-3.7.0.jar,/home/jars/stanford-corenlp-3.7.0/stanford-corenlp-3.7.0-models.jar,/home/jars/stanford-corenlp-3.7.0/protobuf.jar,/home/jars/stanford-corenlp-3.7.0/jollyday.jar

But if I try the same set of jars from CoreNLP 3.8 it always fails with this error:

scala> import edu.stanford.nlp.simple._
scala> new Sentence(document).words()

java.lang.VerifyError: Bad type on operand stack
Exception Details:
  Location:
    com/google/protobuf/GeneratedMessageV3$ExtendableMessage.getExtension(Lcom/google/protobuf/GeneratedMessage$GeneratedExtension;I)Ljava/lang/Object; @3: invokevirtual
  Reason:
    Type 'com/google/protobuf/GeneratedMessage$GeneratedExtension' (current frame, stack[1]) is not assignable to 'com/google/protobuf/ExtensionLite'
  Current Frame:
    bci: @3
    flags: { }
    locals: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 'com/google/protobuf/GeneratedMessage$GeneratedExtension', integer }
    stack: { 'com/google/protobuf/GeneratedMessageV3$ExtendableMessage', 'com/google/protobuf/GeneratedMessage$GeneratedExtension', integer }
  Bytecode:
    0x0000000: 2a2b 1cb6 0024 b0

  at edu.stanford.nlp.simple.Document.<init>(Document.java:433)
  at edu.stanford.nlp.simple.Sentence.<init>(Sentence.java:118)
  at edu.stanford.nlp.simple.Sentence.<init>(Sentence.java:126)
  ... 56 elided

Any help is appreciated,

Cheers,
Maziyar

commented

I have a theory which may be incorrect.

Look at this page on Maven Central:

https://search.maven.org/#artifactdetails%7Corg.apache.spark%7Cspark-parent%7C1.2.2%7Cpom

You'll notice that this project relies on protobuf 2.4.1. Stanford CoreNLP uses protobuf 3.2.0.

I think the mismatch is causing this problem. The bad news is I'm not sure how to resolve this. This page also claims Spark doesn't directly use protobuf, so you could look at your Spark installation and the jars it uses and see if you can manually upgrade to the protobuf 3.2.0 jar.

commented

Even more evidence of the protobuf conflict:

https://github.com/apache/spark/blob/master/pom.xml

commented

My advice is to figure out where Spark's jar dependencies are, and manually change the dependency to 3.2.0 and see if that fixes things.

Hi @J38
Good catch! You are right the Spark pulls the older version 2.5.0 of protobuf-java and when I remove the current protobuf-java-2.5.0.jar and replace it with the latest version (following) it works without any problem.
http://central.maven.org/maven2/com/google/protobuf/protobuf-java/3.4.0/protobuf-java-3.4.0.jar

I opened an issue to see if it is possible to bump the version of protobuf to a newer version:
https://issues.apache.org/jira/browse/SPARK-22380

Many thanks @J38 for your catch, I am going to manually use 3.4 instead of 2.5 and I hope in the next release of Spark this dependency is already the newest version.

Cheers,
Maziyar

I believe that hadoop itself has the dependency on protobuf. This is going to be fixed in 3.0. See https://issues.apache.org/jira/browse/HADOOP-11804