nlp nlp-pipeline apache-spark standford-nlp open-nlp opennlp scala pipeline spark

Some(NLP)

This project is for me to learn some basic NLP concepts and also I integrated it with Spark. There are two types of Pipelines here in the project, first is the NLP pipeline and the second is ML pipeline.

But in this project, I actually wanted to work on the more on the NLP pipeline, and how to customize it. As I was new to both Stanford NLP and Open NLP, most of my time was consumed in understanding these libraries. I have built two pipelines, one is where we can call these libraries interchangeably, and the second one is the CoreStanfordNLP standard pipeline. The advantage of using one library is I needn't convert the results into primitive data types after every step.

Things to note:

I am only using previously trained models, the data used is a Kaggle IMDB dataset. It's size is 34 MB. Feeding this data set to a random forest of size 10, I am getting accuracy 74 percent.

Features implemented	StanFordNLP	OpenNLP	Comments
Tokenizers	✓	✓
StopWords Removers	✗	✗	Using a custom build List
Stemmers	✓	✓	Snowball from OpenNLP and Porter from Stanford
SentenceDetection	✓	✓
Parts-of-speech tagging	✓	✓
Parsers	✓	✓
Lemmatizers	✓	✓	OpenNLP don't have a dictionary. So, using the elasticsearch dict
Named Entity Recognizer	✓	✓	For OpenNLP, only location identifier is implemented

About

Capabilities of StanfordNLP and OpenNLP on Spark

nlp nlp-pipeline apache-spark standford-nlp open-nlp opennlp scala pipeline spark

Apache License 2.0

Languages

Language:Scala 59.7%Language:Java 40.3%