PySpark - Big Data Final Project

Hello! I am Annie and this is my final project for demonstrating the skills we have learnt in Big Data course.

Author

Text Data Input

I have used a text file as input for this project. For this, I fetched the data via URL. The data source is: Spirits in Bondage by C.S. Lewis.

Tools & Languages Used

Language: Python Programming Language
Tools: Spark Processing Engine, PySpark API
Environment: Databricks Cloud Environment

Data Processing

Ingest Data into DBFS
Create an initial RDD using Spark Context
Flatmap to word tokens
Map to intermediate key, value pairs
Filter out stopwords
Reduce by Key to get your counts
Collect back into native Python for charting

Code

Retrieving data from plain text url website and storing the data into nod-new.txt file.

import urllib.request

urllib.request.urlretrieve("https://www.gutenberg.org/files/2003/2003.txt", "/tmp/nod-new.txt")
dbutils.fs.mv("file:/tmp/nod-new.txt","dbfs:/data/nod-new.txt")

Creating initial RDD using Spark.

nServers=6
nodRDD=sc.textFile("dbfs:/data/nod-new.txt", nServers)

Flatmapping the data to word tokens.

# flatmap() eaxh line to words
wordsRDD = nodRDD.flatMap(lambda line: line.strip().split(" "))

Cleaning and preprocessing the data. Removing special characters and spaces and punctuations.

import re
from pyspark.ml.feature import StopWordsRemover

cleanRDD = wordsRDD.map(lambda w1: re.sub(r'[^A-Za-z]', '', w1))
remover = StopWordsRemover()
stopwords = remover.getStopWords()
newRDD = cleanRDD.filter(lambda word: word not in stopwords)

Final RDD:

 finalRDD = newRDD.filter(lambda x: x != "")

Mapping the values to Intermediate Key-value pairs.

# map() to intermediate key-value pairs (word, 1) 

IKVPairsRDD = finalRDD.map(lambda word: (word,1))

Reducing keys to get word-count and displaying all the values.

# reduceByKey() to eord, count (word, sum)
resultsRDD = IKVPairsRDD.reduceByKey(lambda acc, value: acc+value)
print("Reduced key values: ", resultsRDD)
# collect() back into Python DS
results =  resultsRDD.collect()
print("Collections: ", results)

Displaying only 10 values.

# displaying only 10 word count

results10 = resultsRDD.map(lambda x: (x[1], x[0])).sortByKey(False).take(10)
print(results10)

Charting the values using pandas, mathplotlib and seaborn for color.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

source = 'Project Gutenberg: Spirits in Bondage, by (AKA Clive Hamilton) C. S. Lewis'

title = 'Top Words in ' + source
xlabel = 'Words'
ylabel = 'Count'

df = pd.DataFrame.from_records(results10, columns =[xlabel, ylabel]) 
plt.figure(figsize=(20,4))
sns.barplot(xlabel, ylabel, data=df, palette="YlOrBr").set_title(title)

Link to the Databricks Notebook

Chart for 10 values

References

About

This is a final project for demonstrating the skills we have learnt in Big Data course. This mainly includes Python, PySpark and we used Databricks Cloud for processing.

Languages

Language:Jupyter Notebook 100.0%