In this repository will demostrate the ability of Spark with Python programming language of Multi-Class Text Classification with PySpark.
df = spark.read.csv('/home/renos/Downloads/sf-crime/train.csv', header=True, inferSchema=True)
print(df.columns)
print(len(df.columns))
['Dates', 'Category', 'Descript', 'DayOfWeek', 'PdDistrict', 'Resolution', 'Address', 'X', 'Y']
9
The model needs an Input (Descript) and the Output (Category)
data = df[['Category', 'Descript']]
data.printSchema()
data = df[['Category', 'Descript']]
data.printSchema()
data.show(5)
root
|-- Category: string (nullable = true)
|-- Descript: string (nullable = true)
data.groupBy("Category") \
.count() \
.orderBy(col("count").desc()) \
.show()
data.groupBy("Descript") \
.count() \
.orderBy(col("count").desc()) \
.show()
+--------------------+------+
| Category| count|
+--------------------+------+
| LARCENY/THEFT|174900|
| OTHER OFFENSES|126182|
| NON-CRIMINAL| 92304|
| ASSAULT| 76876|
| DRUG/NARCOTIC| 53971|
| VEHICLE THEFT| 53781|
| VANDALISM| 44725|
| WARRANTS| 42214|
| BURGLARY| 36755|
| SUSPICIOUS OCC| 31414|
| MISSING PERSON| 25989|
| ROBBERY| 23000|
| FRAUD| 16679|
|FORGERY/COUNTERFE...| 10609|
| SECONDARY CODES| 9985|
| WEAPON LAWS| 8555|
| PROSTITUTION| 7484|
| TRESPASS| 7326|
| STOLEN PROPERTY| 4540|
|SEX OFFENSES FORC...| 4388|
+--------------------+------+
+--------------------+-----+
| Descript|count|
+--------------------+-----+
|GRAND THEFT FROM ...|60022|
| LOST PROPERTY|31729|
| BATTERY|27441|
| STOLEN AUTOMOBILE|26897|
|DRIVERS LICENSE, ...|26839|
| WARRANT ARREST|23754|
|SUSPICIOUS OCCURR...|21891|
|AIDED CASE, MENTA...|21497|
|PETTY THEFT FROM ...|19771|
|MALICIOUS MISCHIE...|17789|
| TRAFFIC VIOLATION|16471|
|PETTY THEFT OF PR...|16196|
|MALICIOUS MISCHIE...|15957|
|THREATS AGAINST LIFE|14716|
| FOUND PROPERTY|12146|
|ENROUTE TO OUTSID...|11470|
|GRAND THEFT OF PR...|11010|
|POSSESSION OF NAR...|10050|
|PETTY THEFT FROM ...|10029|
|PETTY THEFT SHOPL...| 9571|
+--------------------+-----+
Our pipeline will follow the followed structure.
- Tokenization
- Remove Stop Words
- Count vectors
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import RegexTokenizer, StopWordsRemover, CountVectorizer
regexTokenizer = RegexTokenizer(inputCol="Descript", outputCol="words", pattern="\\W")
add_stopwords = ["http", "https", "amp", "rt", "t", "c", "the", "is", "a", "an", "and"]
stopwordsRemover = StopWordsRemover(inputCol="words", outputCol="filtered").setStopWords(add_stopwords)
countVectors = CountVectorizer(inputCol="filtered", outputCol="features", vocabSize=10000, minDF=5)
label_string_Index = StringIndexer(inputCol="Category", outputCol="label")
pipeline = Pipeline(stages=[regexTokenizer, stopwordsRemover, countVectors, label_string_Index])
# Fit the pipeline to training data.
pipeline_fit = pipeline.fit(data)
dataset = pipeline_fit.transform(data)
dataset.show()
The data has to be splitted into 80/20 percent. That means 80% of the whole data are into train data part and the rest, namely, 20% for the testing part.
(trainingData, testData) = dataset.randomSplit([0.8, 0.2], seed=100)
print(f"Training Dataset length: {trainingData.count()} texts")
print(f"Test Dataset length: {testData.count()} texts")
Training Dataset length: 702085 texts
Test Dataset length: 175964 texts
from pyspark.ml.classification import LogisticRegression
trainingData, testData = self.split_data_for_training()
lRegression = LogisticRegression(maxIter=20, regParam=0.3, elasticNetParam=0)
lrModel = lRegression.fit(trainingData)
predictionsData = lrModel.transform(testData)
predictionsData.filter(predictionsData['prediction'] == 0) \
.select("Descript", "Category", "probability", "label", "prediction") \
.orderBy("probability", ascending=False) \
.show(n=20, truncate=30)
From the array can been shown the label and the prediction class. Can see the label of the text and the prediction of the model of the given text.
+------------------------------+-------------+------------------------------+-----+----------+
| Descript| Category| probability|label|prediction|
+------------------------------+-------------+------------------------------+-----+----------+
|THEFT, BICYCLE, <$50, NO SE...|LARCENY/THEFT|[0.8746706663597197,0.01980...| 0.0| 0.0|
|THEFT, BICYCLE, <$50, NO SE...|LARCENY/THEFT|[0.8746706663597197,0.01980...| 0.0| 0.0|
|THEFT, BICYCLE, <$50, NO SE...|LARCENY/THEFT|[0.8746706663597197,0.01980...| 0.0| 0.0|
|THEFT, BICYCLE, <$50, NO SE...|LARCENY/THEFT|[0.8746706663597197,0.01980...| 0.0| 0.0|
|THEFT, BICYCLE, <$50, NO SE...|LARCENY/THEFT|[0.8746706663597197,0.01980...| 0.0| 0.0|
|THEFT, BICYCLE, <$50, NO SE...|LARCENY/THEFT|[0.8746706663597197,0.01980...| 0.0| 0.0|
|THEFT, BICYCLE, <$50, NO SE...|LARCENY/THEFT|[0.8746706663597197,0.01980...| 0.0| 0.0|
|THEFT, BICYCLE, <$50, SERIA...|LARCENY/THEFT|[0.8746702602899782,0.01980...| 0.0| 0.0|
| PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.8666860596592011,0.01853...| 0.0| 0.0|
| PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.8666860596592011,0.01853...| 0.0| 0.0|
| PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.8666860596592011,0.01853...| 0.0| 0.0|
| PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.8666860596592011,0.01853...| 0.0| 0.0|
| PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.8666860596592011,0.01853...| 0.0| 0.0|
| PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.8666860596592011,0.01853...| 0.0| 0.0|
| PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.8666860596592011,0.01853...| 0.0| 0.0|
| PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.8666860596592011,0.01853...| 0.0| 0.0|
| PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.8666860596592011,0.01853...| 0.0| 0.0|
| PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.8666860596592011,0.01853...| 0.0| 0.0|
| PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.8666860596592011,0.01853...| 0.0| 0.0|
| PETTY THEFT FROM LOCKED AUTO|LARCENY/THEFT|[0.8666860596592011,0.01853...| 0.0| 0.0|
+------------------------------+-------------+------------------------------+-----+----------+
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
predictions = self.model()
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction")
print(f"The Accuracy is: {format(evaluator.evaluate(predictions)*100, '.2f')} %")
The Accuracy is: 97.22 %