Hello! I am Annie and this is my final project for demonstrating the skills we have learnt in Big Data course.
I have used a text file as input for this project. For this, I fetched the data via URL. The data source is: Spirits in Bondage by C.S. Lewis.
- Language: Python Programming Language
- Tools: Spark Processing Engine, PySpark API
- Environment: Databricks Cloud Environment
- Ingest Data into DBFS
- Create an initial RDD using Spark Context
- Flatmap to word tokens
- Map to intermediate key, value pairs
- Filter out stopwords
- Reduce by Key to get your counts
- Collect back into native Python for charting
- Retrieving data from plain text url website and storing the data into nod-new.txt file.
import urllib.request
urllib.request.urlretrieve("https://www.gutenberg.org/files/2003/2003.txt", "/tmp/nod-new.txt")
dbutils.fs.mv("file:/tmp/nod-new.txt","dbfs:/data/nod-new.txt")
- Creating initial RDD using Spark.
nServers=6
nodRDD=sc.textFile("dbfs:/data/nod-new.txt", nServers)
- Flatmapping the data to word tokens.
# flatmap() eaxh line to words
wordsRDD = nodRDD.flatMap(lambda line: line.strip().split(" "))
- Cleaning and preprocessing the data. Removing special characters and spaces and punctuations.
import re
from pyspark.ml.feature import StopWordsRemover
cleanRDD = wordsRDD.map(lambda w1: re.sub(r'[^A-Za-z]', '', w1))
remover = StopWordsRemover()
stopwords = remover.getStopWords()
newRDD = cleanRDD.filter(lambda word: word not in stopwords)
Final RDD:
finalRDD = newRDD.filter(lambda x: x != "")
- Mapping the values to Intermediate Key-value pairs.
# map() to intermediate key-value pairs (word, 1)
IKVPairsRDD = finalRDD.map(lambda word: (word,1))
- Reducing keys to get word-count and displaying all the values.
# reduceByKey() to eord, count (word, sum)
resultsRDD = IKVPairsRDD.reduceByKey(lambda acc, value: acc+value)
print("Reduced key values: ", resultsRDD)
# collect() back into Python DS
results = resultsRDD.collect()
print("Collections: ", results)
Displaying only 10 values.
# displaying only 10 word count
results10 = resultsRDD.map(lambda x: (x[1], x[0])).sortByKey(False).take(10)
print(results10)
- Charting the values using pandas, mathplotlib and seaborn for color.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
source = 'Project Gutenberg: Spirits in Bondage, by (AKA Clive Hamilton) C. S. Lewis'
title = 'Top Words in ' + source
xlabel = 'Words'
ylabel = 'Count'
df = pd.DataFrame.from_records(results10, columns =[xlabel, ylabel])
plt.figure(figsize=(20,4))
sns.barplot(xlabel, ylabel, data=df, palette="YlOrBr").set_title(title)