P7h / Coursera__UCSD__Big_Data_Specialization

"Big Data" Specialization -- University of California, San Diego and Coursera

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

"Big Data" Specialization -- University of California, San Diego and Coursera

University of California, San Diego created a specialization on Coursera for "Big Data". https://www.coursera.org/specializations/big-data

This repo contains solutions in Spark 2.x [and where possible even with 1.6.x] with Scala for Spark-specific questions in "Big Data" Specialization.

All solutions are in Apache Toree Jupyter notebooks and in Scala.

Big Data Integration and Processing

Final Project

Analytics to be determined processing 2 csv files one with tweets and another countries.

  1. As a Sports Analyst, you are interested in how many different countries are mentioned in the tweets. Use the Spark to calculate this number. Note that regardless of how many times a single country is mentioned, this country only contributes 1 to the total.
  2. Next, compute the total number of times any country is mentioned. This is different from the previous question since in this calculation, if a country is mentioned three times, then it contributes 3 to the total.
  3. Your next task is to determine the most popular countries. You can do this by finding the three countries mentioned the most.
  4. After exploring the dataset, you are now interested in how many times specific countries are mentioned. For example, how many times was France mentioned?
  5. Which country has the most mentions: Kenya, Wales, or Netherlands?
  6. Finally, what is the average number of times a country is mentioned?

Hadoop Platform and Application Framework

Spark Joins

About

"Big Data" Specialization -- University of California, San Diego and Coursera

License:Apache License 2.0


Languages

Language:Jupyter Notebook 100.0%