jsteinberg4 / icarus

I Could Actually Really Use Support (ICARUS): A custom implementation of MapReduce

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

ICARUS build

I Could Actually Really Use Support (ICARUS): A MapReduce Implementation

Getting Started

Setup

Dependencies: make, a Unix platform which supports C++14

From the repository root, run the following:

# To compile the binaries and setup intermediate folders
$ make all

# To discard previous runs output & old binaries
$ make clean all

The following are some additional useful make targets:

# Delete the binaries in bin/*. Careful, this also deletes the intermediate files generated by mapper and reducer. Final map reduce outputs are left alone.
make clean

# Only delete the intermediate files generated for map and reduce. Don't delete any binaries. Use if your disk is getting cluttered.
make reset-inputs

This will generate executables to bin/.

bin/
  master   # A MapReduce master node. Run on an isolated VM.
  worker   # MapReduce workers. Run one per virtual machine.
  mapper   # The Map binary. A word counting algorithm is implemented by default. Do not run directly.
  reducer  # The Reduce binary. A summation for the word counting algorithm is implemented by default. Do not run directly.

Optional Development Dependencies: python3, virtualenv, compiledb
make does not generate a compile_commands.json to help the clang LSP interpret code files. I use compiledb to generate these.

# 1) Make a virtual environment
python3 -m pip install virtualenv # (you may already have this installed)
python3 -m virtualenv venv

# 2) Install dev dependencies
source venv/bin/activate
pip install -r requirements-compile.txt

# 3) Generate compile_commands.json
make clean # Force a complete rebuild
compiledb make all

Usage

All run instructions assume the repository root to be your working directory. Each of the usage messages will be printed by running bin/master or bin/worker with no arguments.

Running the Master:

Note, the final result will be written to the folder mapReduceOutputs/. The master prints the absolute file path to stdout upon completion.

$ bin/master
Usage:
    bin/master [port] [root directory] [input path] [# mappers]

    port: Specify which TCP port to listen at
    root directory: Specify an absolute path as the working directory. All other filepaths internally will use this as a base. It will almost always be the repository root.
    input path: Specify the task's input file. Assumed relative to the root.
    mappers: Specify the number of map tasks to create from the input file

Running the Workers:

$ bin/worker
Usage:
  bin/worker <master IP> <master port> <num workers> [(optional) failure chance]

  master ip: the IP address used by bin/master

  master port: the open port specified as <port> when running bin/master

  num workers: Specify the size of the worker pool. A value of 0 will run a single worker instance which exits the whole program on errors. Any value 1...N will maintain a pool of N child processes, each of which independently connects to the master.

  failure chance: If provided, enables simulated worker failures by killing child processes with probability 1 in <failure chance>. For example, a value of 5 means workers will be killed with probability 1 in 5 (20%). Num workers must be at least 1. If not provided, failure simulation is skipped.

Examples:

To run the Word Counter benchmark with the master node listening on port 80080 and 100 map partitions. Run 4 virtual nodes per bin/worker execution with a 20% chance of simulated failures.

# On one virtual machine (or terminal):
$ bin/master 80080 $(pwd) inputs/triple_large.txt 100

# On different virtual machines/terminals
bin/worker <master_ip> 80080 4 5

To run the word counter algorithm, using the complete works of Shakespeare partitioned into 100 map tasks, using 4 workers and no simulated failures.

# Run the master at host 12.34.56.78:54321
$ bin/master 54321  $(pwd) inputs/complete_shakespeare.txt 100

# Run the worker pool
$ bin/worker 12.34.56.78 54321 4
# Equivalent: bin/worker 12.34.56.78 54321 4 0

Citations

Data sources

Original Authors

Jeffrey Dean and Sanjay Ghemawat. OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA (2004), pp. 137-150. https://research.google/pubs/mapreduce-simplified-data-processing-on-large-clusters/

About

I Could Actually Really Use Support (ICARUS): A custom implementation of MapReduce

License:MIT License


Languages

Language:C++ 95.9%Language:Makefile 4.1%