giffels / cobald-tardis-spark

Setup scripts and documentation to integrate Spark into the Cobald/Tardis system

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Integration of Spark into the Cobald/Tardis system

Setup scripts and documentation to integrate Spark into the Cobald/Tardis system

Setup

  1. Clone this repository including the submodules
git clone --recursive https://github.com/stwunsch/cobald-tardis-spark
  1. Install the required software

The install.sh script installs the required Python and Java software.

cd cobald-tardis-spark/
./install.sh
  1. Set the configuration

Have a look at the config.sh file, set the correct configuration and run the configure.sh script.

./configure.sh

Test Spark running on Yarn

  1. Adapt the config in hadoop-config/yarn-site.xml and set the number for yarn.nodemanager.resource.cpu-vcores to at least 2 and set the number for yarn.nodemanager.resource.memory-mb to at least 2500.

  2. Go to the machine which should act as the master (aka resourcemanager in Yarn) and run

./run-resourcemanager.sh
  1. Go to the machine which should act as the worker (aka nodemanager in Yarn) and run
./run-nodemanager.sh
  1. Run the test script
./test-spark.sh

About

Setup scripts and documentation to integrate Spark into the Cobald/Tardis system


Languages

Language:Python 78.4%Language:Shell 18.1%Language:C++ 3.6%