Spark-nlp-Pyspark

This project is focused on building a Spark ML Pipeline using Pyspark to perform natural language processing on a dataset. The pipeline uses the following annotators:

Getting Started

To get started with the project, you will need to have Spark and Pyspark installed on your machine. Additionally, you will need to import the necessary libraries, including the pretrained models for English.

Prerequisites

Installing

To install Spark and Pyspark, please follow the instructions provided on the respective websites. To install the Spark NLP library, you can use the following command in your Pyspark project:

!pip install spark-nlp

Running the Application

The application is run by executing the script file containing the pipeline. The pipeline will read the input dataset, and it will print the transformed DataFrame showing only the POS column and the NER column. As a bonus, it will only show the result attribute of these annotations. The result attribute of NER and POS will be collected, and the relationship between found entities and their part of speech attributes will be analyzed and explained.

About

This project is a Spark ML pipeline using Pyspark for NLP, using annotators: DocumentAssembler, Tokenizer, WordEmbeddingsModel, PerceptronModel & NerCrfModel. It prints a transformed DataFrame showing POS & NER columns, and analyzes any relationship between found entities & their POS attributes. Hands-on experience with Spark, Pyspark & Spark-NLP.

Languages

Language:Jupyter Notebook 100.0%