BardisRenos / PySpark-ML

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PySpark-ML

In this repository will demostrate the ability of Spark with Python programming language of Multi-Class Text Classification with PySpark.

Data structure

df = spark.read.csv('/home/renos/Downloads/sf-crime/train.csv', header=True, inferSchema=True)
print(df.columns)
print(len(df.columns))
['Dates', 'Category', 'Descript', 'DayOfWeek', 'PdDistrict', 'Resolution', 'Address', 'X', 'Y']
9

Train data set

The model needs an Input (Descript) and the Output (Category)

data = df[['Category', 'Descript']]
data.printSchema()
data = df[['Category', 'Descript']]
data.printSchema()
data.show(5)
root
 |-- Category: string (nullable = true)
 |-- Descript: string (nullable = true)

Showing the structure of the data

data.groupBy("Category") \
    .count() \
    .orderBy(col("count").desc()) \
    .show()

data.groupBy("Descript") \
    .count() \
    .orderBy(col("count").desc()) \
    .show()
+--------------------+------+
|            Category| count|
+--------------------+------+
|       LARCENY/THEFT|174900|
|      OTHER OFFENSES|126182|
|        NON-CRIMINAL| 92304|
|             ASSAULT| 76876|
|       DRUG/NARCOTIC| 53971|
|       VEHICLE THEFT| 53781|
|           VANDALISM| 44725|
|            WARRANTS| 42214|
|            BURGLARY| 36755|
|      SUSPICIOUS OCC| 31414|
|      MISSING PERSON| 25989|
|             ROBBERY| 23000|
|               FRAUD| 16679|
|FORGERY/COUNTERFE...| 10609|
|     SECONDARY CODES|  9985|
|         WEAPON LAWS|  8555|
|        PROSTITUTION|  7484|
|            TRESPASS|  7326|
|     STOLEN PROPERTY|  4540|
|SEX OFFENSES FORC...|  4388|
+--------------------+------+

+--------------------+-----+
|            Descript|count|
+--------------------+-----+
|GRAND THEFT FROM ...|60022|
|       LOST PROPERTY|31729|
|             BATTERY|27441|
|   STOLEN AUTOMOBILE|26897|
|DRIVERS LICENSE, ...|26839|
|      WARRANT ARREST|23754|
|SUSPICIOUS OCCURR...|21891|
|AIDED CASE, MENTA...|21497|
|PETTY THEFT FROM ...|19771|
|MALICIOUS MISCHIE...|17789|
|   TRAFFIC VIOLATION|16471|
|PETTY THEFT OF PR...|16196|
|MALICIOUS MISCHIE...|15957|
|THREATS AGAINST LIFE|14716|
|      FOUND PROPERTY|12146|
|ENROUTE TO OUTSID...|11470|
|GRAND THEFT OF PR...|11010|
|POSSESSION OF NAR...|10050|
|PETTY THEFT FROM ...|10029|
|PETTY THEFT SHOPL...| 9571|
+--------------------+-----+

Creating the Model Pipeline

Our pipeline will follow the followed structure.

  • Tokenization
  • Remove Stop Words
  • Count vectors
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, CountVectorizer

regexTokenizer = RegexTokenizer(inputCol="Descript", outputCol="words", pattern="\\W")
add_stopwords = ["http", "https", "amp", "rt", "t", "c", "the", "is", "a", "an", "and"]
stopwordsRemover = StopWordsRemover(inputCol="words", outputCol="filtered").setStopWords(add_stopwords)

countVectors = CountVectorizer(inputCol="filtered", outputCol="features", vocabSize=10000, minDF=5)
label_string_Index = StringIndexer(inputCol="Category", outputCol="label")
pipeline = Pipeline(stages=[regexTokenizer, stopwordsRemover, countVectors, label_string_Index])

# Fit the pipeline to training data.
pipeline_fit = pipeline.fit(data)
dataset = pipeline_fit.transform(data)
dataset.show()

Splitting the data into train & test sets

The data has to be splitted into 80/20 percent. That means 80% of the whole data are into train data part and the rest, namely, 20% for the testing part.

(trainingData, testData) = dataset.randomSplit([0.8, 0.2], seed=100)
print(f"Training Dataset length: {trainingData.count()} texts")
print(f"Test Dataset length: {testData.count()} texts")
Training Dataset length: 702085 texts
Test Dataset length: 175964 texts

Model of classification

from pyspark.ml.classification import LogisticRegression

trainingData, testData = self.split_data_for_training()

lRegression = LogisticRegression(maxIter=20, regParam=0.3, elasticNetParam=0)
lrModel = lRegression.fit(trainingData)

predictionsData = lrModel.transform(testData)
predictionsData.filter(predictionsData['prediction'] == 0) \
    .select("Descript", "Category", "probability", "label", "prediction") \
    .orderBy("probability", ascending=False) \
    .show(n=20, truncate=30)

Evaluation

From the array can been shown the label and the prediction class. Can see the label of the text and the prediction of the model of the given text.

+------------------------------+-------------+------------------------------+-----+----------+
|                      Descript|     Category|                   probability|label|prediction|
+------------------------------+-------------+------------------------------+-----+----------+
|THEFT, BICYCLE, <$50, NO SE...|LARCENY/THEFT|[0.8746706663597197,0.01980...|  0.0|       0.0|
|THEFT, BICYCLE, <$50, NO SE...|LARCENY/THEFT|[0.8746706663597197,0.01980...|  0.0|       0.0|
|THEFT, BICYCLE, <$50, NO SE...|LARCENY/THEFT|[0.8746706663597197,0.01980...|  0.0|       0.0|
|THEFT, BICYCLE, <$50, NO SE...|LARCENY/THEFT|[0.8746706663597197,0.01980...|  0.0|       0.0|
|THEFT, BICYCLE, <$50, NO SE...|LARCENY/THEFT|[0.8746706663597197,0.01980...|  0.0|       0.0|
|THEFT, BICYCLE, <$50, NO SE...|LARCENY/THEFT|[0.8746706663597197,0.01980...|  0.0|       0.0|
|THEFT, BICYCLE, <$50, NO SE...|LARCENY/THEFT|[0.8746706663597197,0.01980...|  0.0|       0.0|
|THEFT, BICYCLE, <$50, SERIA...|LARCENY/THEFT|[0.8746702602899782,0.01980...|  0.0|       0.0|
|  PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.8666860596592011,0.01853...|  0.0|       0.0|
|  PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.8666860596592011,0.01853...|  0.0|       0.0|
|  PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.8666860596592011,0.01853...|  0.0|       0.0|
|  PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.8666860596592011,0.01853...|  0.0|       0.0|
|  PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.8666860596592011,0.01853...|  0.0|       0.0|
|  PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.8666860596592011,0.01853...|  0.0|       0.0|
|  PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.8666860596592011,0.01853...|  0.0|       0.0|
|  PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.8666860596592011,0.01853...|  0.0|       0.0|
|  PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.8666860596592011,0.01853...|  0.0|       0.0|
|  PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.8666860596592011,0.01853...|  0.0|       0.0|
|  PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.8666860596592011,0.01853...|  0.0|       0.0|
|  PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.8666860596592011,0.01853...|  0.0|       0.0|
+------------------------------+-------------+------------------------------+-----+----------+

Accuracy

from pyspark.ml.evaluation import MulticlassClassificationEvaluator

predictions = self.model()
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction")
print(f"The Accuracy is:  {format(evaluator.evaluate(predictions)*100, '.2f')} %")
The Accuracy is:  97.22 %

About


Languages

Language:Python 100.0%