Author : Tarun Sarpanjeri
- Python
- DataBricks Notebook
- PySpark
- Pandas
- Spark processing engine
Here is the Link.
To pull the data into notebook we use urllib.request to pull data from a url and store it in a temporary text file named 'tarun.txt'. The data I have used is saved in a text-file in my github repo.
# Import the library for processing url request.
import urllib.request
# Store the data by retrieving into a temporary file
urllib.request.urlretrieve("https://github.com/dexterstr/Tarun-Bigdata-Project/blob/main/The_Great_Gatsby.txt" , "/tmp/tarun.txt")
For saving data into notebook, we use method 'dbutils.fs.mv' which uses two arguments for sending data from one location to other.
dbutils.fs.mv("file:/tmp/tarun.txt","dbfs:/data/tarun.txt")
Alas, Spark holds data in RDDs format i.e. Resilient Distributed Databses so,we transform data into RDDs.
tarunRDD = sc.textFile("dbfs:/data/tarun.txt")
As the data contains punctuations, sentences and even empty lines ans StopWords. We clean the data by splitting each line by spaces and changing complete text to lower-case, breaking all sentences into words and removing empty lines.
cleanRDD=tarunRDD.flatMap(lambda line : line.lower().strip().split(" "))
All punctuations can be removed by using Regular Expression which finds terms except letters.To use Reg-ex , library 're' must be imported.
import re
cleanTokensRDD = cleanRDD.map(lambda w: re.sub(r'[^a-zA-Z]','',w))
Finally,StopWords must be removed.PySpark knows StopWords so, we just need to import StopWordsRemover which will filter out the words.
from pyspark.ml.feature import StopWordsRemover
remover =StopWordsRemover()
stopwords = remover.getStopWords()
cleanwordRDD=cleanTokensRDD.filter(lambda w: w not in stopwords)
Mapping the words into key-Vlaue pairs where we will be taking word as key and check how many times it occurs and save the number in format(word,1).
KeyValuePairsRDD= cleanwordRDD.map(lambda word: (word,1))
Reduce by Key-Key is the word. we'll save the word and when it repeats , we will remove it and add 1 to count.
wordCountRDD = KeyValuePairsRDD.reduceByKey(lambda acc, value: acc+value)
To retrieve all elements from data we will use collect method to save and use print()method to show result.
results = wordCountRDD.collect()
print(results)
we will use SortByKey method to list words in descending order and print top 11 results in 'The Great Gatsby'.
sort_results = wordCountRDD.map(lambda x: (x[1], x[0])).sortByKey(False).take(11)
print(sort_results)
We will be using MatplotLib to plot graph.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
source = 'Great Gatsby'
title = 'The ' + source
xlabel = 'Count'
ylabel = 'Words'
df = pd.DataFrame.from_records(sort_results, columns =[xlabel, ylabel])
plt.figure(figsize=(10,3))
sns.barplot(xlabel, ylabel, data=df, color="black").set_title(title)