Azure-ML-vs-AWS-SageMaker-

Overview

In this tutorial, you will go through various ways of importing, transforming, analyzing, and exporting data with SageMaker. This tutorial will walk you through Data Wrangler in the AWS SageMaker Studio and AWS SageMaker Canvas.

Data Wrangler

Step 0: Before You Start

AWS Account - Sign up for a free AWS account here

Step 1: Download Dataset and Upload to S3 Bucket

Download Titanic Dataset from the Github page - Grab the .CSV file
Navigate to AWS portal and type S3 in the Search Bar
- Hit Create a bucket: Name your bucket and leave all other settings as default.

Upload .csv file into the created bucket using the Upload button

Step 2: Set up Studio

Navigate to AWS portal and type in AWS SageMaker
Create a SageMaker Domain

Step 3: Get Started with Data Wrangler

Hit Launch App -> Studio
Create a Data Wrangler Flow by going to File -> New -> Data Wrangler Flow
Import S3 Data
- Choose your S3 bucket and Click on the .CSV file that you uploaded. You should be able to see a Preview version of your data on the very bottom. Then click Import at the very top.

Step 4: Begin adding to the Flow

In the Data Flow, you should just see 2 steps - the data from S3 and the data types. Data Wrangler has inferred the data types for you already.
Analyze: Now let's start analyzing the data. To get a feel of the data, we can do + Add Analysis. For Analysis Type, click on "Table Summary" from the Drop Down. As you can see from the output, there are many missing values in the columns: cabin, embarked, and home.dest. In addition, there are outliers as well. You can play around with other pre-built Analysis features like creating a Histogram.
Transform: Now let's transform the data. This is where you can manipulate the data by cleaning it up. Click on + Add Transform. Then + Add Step on the top right of the screen.
- Drop Columns: Now here, we want to select the built in option to drop the columns that are unnecessary. In Transform, we are going to select the Drop column option. And for column's to drop, we will select: cabin, ticket, name, sibsp, parch, home.dest, boat, and body. Then hit preview. And finally, click Add.
- Missing Data: + Add Step again. Click on the Handle Missing built in option. Under Transform, click on Drop missing. Then for Input columns, select age. Then hit preview. Lastly, click Add. Repeat this step for the column, fare.
- Let's take a look at the Custom Tranform option. Select Python(Pandas). Enter this query: df.info(). You will see 1045 entries that are remaining after all the transformations. We do not have to save this. Let's click on Custom transform again. Select Python(Pandas).

Insert this code:

import pandas as pd

dummies = []
cols = ['pclass','sex','embarked']
for col in cols:
    dummies.append(pd.get_dummies(df[col]))
    
encoded = pd.concat(dummies, axis=1)

df = pd.concat((df, encoded),axis=1)

Hit preview and click Add. You should see some new columns on the right.

SQL: Hit Custom Transform again. Click on SQL(PySpark SQL). Here we want to select the columns we want to keep.

Insert this code:
```
SELECT survived, age, fare, 1, 2, 3, female, male, C, Q, S FROM df;
```
You're done with the Data Flow! Now we want to Export. Let's explore different options by going to +Export.

Step 5: Clean up!

Make sure all and any S3 buckets are deleted!
In the Studio: On the left, click on Running Terminals and Kernals. Next to Running Apps click on the X. This will close out all your application sessions. I also close all my notebooks.
Delete all Domain's open for the Studio and Canvas (Part 2)

Canvas

Tutorial - https://docs.aws.amazon.com/sagemaker/latest/dg/canvas-getting-started.html

DataSciNAll / Azure-ML-vs-AWS-SageMaker-