Tech-with-Vidhya / apache-spark-rdd-computations-E2E-implementation-with-transformations-and-actions-gutenberg-data

This project is delivered as part of my Masters in Big Data Science (MSc BDS) Program for the module named “Big Data Processing” in Queen Mary University of London (QMUL), London, United Kingdom. This project covers the development of Spark RDD computations from scratch using python’s pyspark package and regular expressions functions for the “Gutenberg” private data files; containing hundreds of books downloaded from the project Gutenberg, written in different languages. Implemented the use of basic transformations (namely flatMap, map, reduceByKey) and actions on the RDDs and submitted spark jobs to the cluster. Identified solutions to the questions namely: 1. Counting the total number of words 2. Total number of occurrences of each unique word 3. Computation of Top 10 words using the Spark’s ‘takeOrdered’ function **NOTE:** Due to the data privacy and the data protection policy to be adhered by the students; the datasets and the solution related code are not exposed and updated in the GitHub public profile; in order to be compliant with the Queen Mary University of London (QMUL) policies.

apache-spark-rdd-computations-E2E-implementation-with-transformations-and-actions-gutenberg-data

This project is delivered as part of my Masters in Big Data Science (MSc BDS) Program for the module named “Big Data Processing” in Queen Mary University of London (QMUL), London, United Kingdom.

This project covers the development of Spark RDD computations from scratch using python’s pyspark package and regular expressions functions for the “Gutenberg” private data files; containing hundreds of books downloaded from the project Gutenberg, written in different languages.

Implemented the use of basic transformations (namely flatMap, map, reduceByKey) and actions on the RDDs and submitted spark jobs to the cluster.

Implemented solutions to the below questions and scenarios:

Counting the total number of words
Total number of occurrences of each unique word
Computation of Top 10 words using the Spark’s ‘takeOrdered’ function

NOTE: Due to the data privacy and the data protection policy to be adhered by the students; the datasets and the solution related code are not exposed and updated in the GitHub public profile; in order to be compliant with the Queen Mary University of London (QMUL) policies.

About