Amazon SageMaker Spark Strata tutorial at the 2018 O'Reilly Strata NYC Conference
This repository contains supporting material for the Amazon SageMaker tutorial at the 2018 O'Reilly Strata NYC Conference.
Setup
- Log into your AWS account
- Select EMR from services and create a new cluster:
- Go to advanced options
- Select Spark and Livy (only)
- Click next through the rest (feel free to give a custom name, etc.)
- Select SageMaker from services and create a SageMaker notebook instance:
- Create a new IAM role with access to any S3 bucket
- Use the same VPC as your EMR cluster
- Take note of security group
- Return to your EMR cluster
- Take note of master node private IP address
- Click on the security group for your master node
- Add an inbound rule for Custom TCP on port 8998 with the notebook security group as the source
- Select IAM from services
- Select Roles
- Select EMR_EC2_DefaultRole
- Add AmazonSageMakerFullAccess policy
- Open your SageMaker notebook instance and start a new terminal and run:
echo '{"kernel_python_credentials" : {"url": "http://<emr-master-private-ip>:8998/"}, "session_configs": {"executorMemory": "2g","executorCores": 2,"numExecutors":4}}' > ~/.sparkmagic/config.json
curl <emr-master-private-ip>:8998/sessions
git clone https://github.com/djarpin/strata-sagemaker-spark.git
Contents
- XGBoost with EMR - A light modification of the existing XGBoost example notebook to run in an external EMR cluster.
- BYO - A modification of the existing custom estimator example notebook to train using a convolutional neural network in a bring your own PyTorch container.