Capstone Spark project

Data

Steps to generate input data for your project:

Copy the bigdata-input-generator folder in the root directory of your exam repo
Install the required python packages and run the main script from the root directory of the exam repo to generate the datasets (you can use virtualenv for this step):

pip install -r bigdata-input-generator/requirements.txt
python bigdata-input-generator/main.py

Add capstone-dataset (generated folder) to .gitignore in your exam repo

Data structure

Mobile App clickstream projection

Schema:

userId: String
eventId: String
eventTime: Timestamp
eventType: String
attributes: Map[String, String]

There could be events of the following types that form a user engagement session:

app_open
search_product
view_product_details
purchase
app_close

Events of app_open type may contain the attributes relevant to the marketing analysis:

campaign_id
channel_id

Events of purchase type contain purchase_id attribute.

Sample: mobile-app-clickstream_sample

Generated dataset:

Purchases projection

Schema:
- purchaseId: String
- purchaseTime: Timestamp
- billingCost: Double
- isConfirmed: Boolean

Tasks

General requirements

The requirements that apply to all tasks:

Use Spark version 2.4 or higher
The logic should be covered with unit tests
The output should be saved as PARQUET files.
Will be a plus: README file in the project documenting the solution.
Will be a plus: Integrational tests that cover the main method.

1. Build Purchases Attribution Projection

The projection is dedicated to enabling a subsequent analysis of marketing campaigns and channels. The target schema:

purchaseId: String
purchaseTime: Timestamp
billingCost: Double
isConfirmed: Boolean

// a session starts with app_open event and finishes with app_close

sessionId: String
campaignId: String // derived from app_open#attributes#campaign_id
channelIid: String // derived from app_open#attributes#channel_id

Requirements for implementation of the projection building logic:

Implement it by utilizing default Spark SQL capabilities.
Implement it by using a custom UDF.

2. Calculate Marketing Campaigns And Channels Statistics

Use the purchases-attribution projection to build aggregates that provide the following insights: Top Campaigns:

What are the Top 10 marketing campaigns that bring the biggest revenue (based on billingCost of confirmed purchases)?

Channels engagement performance:

What is the most popular (i.e. Top) channel that drives the highest amount of unique sessions (engagements) with the App in each campaign?

Requirements for task #2:

Should be implemented by using plain SQL on top of Spark DataFrame API
Will be a plus: an additional alternative implementation of the same tasks by using Spark DataFrame API only (without plain SQL)

Solution

create virtualenv
clone
cd to progect
install requirements
export PYTHONPATH=${PWD} pwd
run like python jobs/main.py

1.1

Using standard DataFrame API implemented a function with custom creation of sessionId field based on event types. Cleared attributes field to be in the same format as the rest of the dataset. Retrieved needed columns from it.

1.2

2.1

Using both plain SQL and DataFrame API got the names of top marketing campaigns hat bring the biggest revenue.

2.2

Using both plain SQL and DataFrame API got the names of the channels that drives the highest amount of unique sessions in each campaign.

ashedow / gridu-SPARK