newfront / spark-intro-to-ml

A Gentle introduction to Machine Learning with Apache Spark

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Workshop Material: Introduction to Machine Learning with Apache Spark and Redis

About the Speaker

Find me on Twitter: @newfront Find me on Medium @newfrontcreative About Twilio: Twilio

Runtime Requirements

  1. Docker (at least 2 CPU cores and 8gb RAM)
  2. System Terminal (iTerm, Terminal, etc)
  3. Working Web Browser (Chrome or Firefox)

Docker

Install Docker Desktop (https://www.docker.com/products/docker-desktop)

Additional Docker Resources:

Docker Runtime Recommendations

  1. 2 or more cpu cores.
  2. 8gb/ram or higher.

Configuration

  1. The Apache Spark configuration is stored in /install/spark-defaults.conf. You can update those settings to match the configuration of your Docker setup.

The spark defaults are below.

spark.cores.max 4
spark.executor.memory 8g

Installation

  1. Install Docker (See Docker above)
  2. Once Docker is installed. Open up your terminal application and cd /spark-intro-to-ml/docker
  3. ./run.sh install
  4. ./run.sh start

Checking Zeppelin and Updating Zeppelin

  1. The Main Application should now be running at http://localhost:8080/
  2. docker exec -it redis5 redis-cli should show 127.0.0.1:6379> this should be a new install. Try inputting info to see the redid-server configuration.

Monitoring Redis as you run the Workshop Material

The following command will let you view all commands hitting redis during the workshop

docker exec -it redis5 redis-cli monitor

The Workshop

  1. Open up your Browser on http://localhost:8080 and you should see the Zeppelin Home Screen
  2. Click on the Notebook named 1-LoadAndQuery. When this loads select the spark and md interpreters to attach to the notebook and then press the button at the top that says Run All Paragraphs
  3. This first note in the notebook will take you through to 2-LoadTransformAndCluster and finally to 3-ReloadAndPredictLogistically

Datasets and Technology Used: No need to download these. The install script will do it all for you.

Technologies Used

  1. Apache Zeppelin
  2. Apache Spark
  3. Redis

Spark 2.4.5

Redis Docker Hub (v5.0.7)

https://hub.docker.com/_/redis/

Spark Redis (v2.4.0)

https://github.com/RedisLabs/spark-redis

Datasets

Full Training Videos (Free)

  1. Getting Ready - Intro to Zeppelin
  2. Part 1 - Spark Basics
  3. Part 2 - Exploratory Data Analysis
  4. Part 3 - Feature Engineering
  5. Part 4 - Regression Techniques with Linear and Logistic Regression
  6. Part 5 - Streaming Predictions

About

A Gentle introduction to Machine Learning with Apache Spark

License:Apache License 2.0


Languages

Language:Scala 85.6%Language:Shell 14.4%