intron / ipython-spark-docker

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ipython-spark-docker

Please see the accompanying blog post, Using Docker to Build an IPython-driven Spark Deployment, for the technical details and motivation behind this project. This repo provides Docker containers to run:

  • Spark master and worker(s) on dedicated hosts
  • IPython user interface within a dedicated Spark client

Architecture

Docker containers provide a portable and repeatable method for deploying the cluster:

hadoop-docker-client connections

CDH5 Tools and Libraries

HDFS Hbase Hive Oozie Pig Hue

Python Packages and Modules

Pattern NLTK Pandas NumPy SciPy SymPy Seaborn
Cython Numba Biopython Rmagic 0MQ Matplotlib Scikit-Learn
Statsmodels Beautiful Soup NetworkX LLVM Bokeh Vincent MDP

Usage

Installation and Deployment - Build each Docker image and run each on separate dedicated hosts

Tip: Build a common/shared host image with all necessary configurations and pre-built containers, which you can then use to deploy each node. When starting each node, you can pass the container run scripts as User data to initialize that container at boot time
  1. Prerequisites
  • Deploy Hadoop/HDFS cluster. Spark uses a cluster to distrubute analysis of data pulled from multiple sources, including the Hadoop Distrubuted File System (HDFS). The ephemeral nature of Docker containers make them ill-suited for persisting long-term data in a cluster. Instead of attempting to store data within the Docker containers' HDFS nodes or mounting host volumes, it is recommended you point this cluster at an external Hadoop deployment. Cloudera provides complete resources for installing and configuring its distribution (CDH) of Hadoop. This repo has been tested using CDH5.
  1. Build and configure hosts

  2. Install Docker v1.5+, jq JSON processor, and iptables. For example, on an Ubuntu host:

    ./0-prepare-host.sh

  3. Update the Hadoop configuration files in runtime/cdh5/<hadoop|hive>/<multiple-files> with the correct hostnames for your Hadoop cluster. Use grep FIXME -R . to find hostnames to change.

  4. Generate new SSH keypair (config/ssh/id_rsa and config/ssh/id_rsa.pub), adding the public key to config/ssh/authorized_keys.

  5. (optional) Update SPARK_WORKER_CONFIG environment variable for Spark-specific options such as executor cores. Update the variable via a shell export command or by updating config/sv/spark-client-iython/ipython/run.

  6. (optional) Comment out any unwanted packages in the base Dockerfile image dockerfiles/lab41/spark-base.dockerfile.

  7. Get Docker images:

Option A: If you prefer to pull from Docker Hub:
docker pull lab41/spark-master
docker pull lab41/spark-worker
docker pull lab41/spark-client-ipython
Option B: If you prefer to build from scratch yourself:
./1-build.sh
If you are creating common/shared host images, this would be the point to snapshot the host image for replication.
  1. Deploy cluster nodes
Ensure each host has a Fully-Qualified-Domain-Name (i.e. master.domain.com; worker1.domain.com; ipython.domain.com) for the Spark nodes to properly associate
1. Run the master container on the master host:
./2-run-spark-master.sh
2. Run worker container(s) on worker host(s) (replace 'spark-master-fqdn' below):
./3-run-spark-worker.sh spark://spark-master-fqdn:7077
3. Run the client container on the client host (replace 'spark-master-fqdn' below):
./4-run-spark-client-ipython.sh spark://spark-master-fqdn:7077

About

License:Other


Languages

Language:Python 81.4%Language:Shell 14.5%Language:JavaScript 3.8%Language:CSS 0.3%