huseinzol05 / Pyspark-ML

Gathers data science and machine learning problem solving using PySpark and Hadoop.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pyspark-ML

Gathers data science and machine learning problem solving using PySpark and Hadoop.

Covered

  1. Test Pyspark
  2. Text classification IMDB dataset using logistic regression
  3. Text classification IMDB dataset using multinomial
  4. Topic Modelling TFIDF + LDA
  5. Word Vector
  6. Read Iris csv from Hadoop DFS
  7. PCA on Iris dataset
  8. MNIST feed-forward sparkflow
  9. MNIST CNN sparkflow
  10. MNIST RNN-LSTM sparkflow
  11. Fashion-MNIST Inception v1 sparkflow

How-to Notebook

  1. Run docker compose,
compose/build

Or you can choose cluster mode,

docker-compose -f docker-compose-cluster.yml up --build --remove-orphans
  1. Visit localhost:8089 for passwordless jupyter notebook.

How-to Hadoop

Check Hadoop health, localhost:9870

Hadoop DFS Web UI, localhost:9870/explorer.html#/

Hadoop Node Manager, localhost:8042/node

How-to Spark-cluster

If success using cluster mode,

slave_2   | 2018-11-18 07:57:59 INFO  Worker:54 - Successfully registered with master spark://192.168.128.2:7077
slave_1   | 2018-11-18 07:58:10 INFO  Worker:54 - Successfully registered with master spark://192.168.128.2:7077

Check Spark health, localhost:8080

About

Gathers data science and machine learning problem solving using PySpark and Hadoop.

License:MIT License


Languages

Language:Jupyter Notebook 88.8%Language:Python 9.8%Language:Dockerfile 1.0%Language:Shell 0.4%