pcrama / scala-map-reduce

Toy implementation of map-reduce idea in Scala (coding assignment)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MapReduce scale model

This implementation is a scale model of the MapReduce idea, based on the following types and this example.

  • The input data is a List[Inp], each element is passed to a Mapper, producing an association map from MapKey to Val.
  • These (MapKey, Val) pairs will be passed to a shuffle function to produce a new key (ShuffleKey). All Val associated with the same ShuffleKey are gathered in a list for the next step.
  • Then these (ShuffleKey, List[Val]) pairs are passed to a reducer function, giving (ShuffleKey, Result) associations.

Restrictions

No real parallelism

The work is dispatched sequentially to instances instead of parallelized on different workers.

In-memory data

All data is kept in RAM, effectively limiting the data size.

Shufflers must accumulate all data

The reducer works on a complete list. If we look at the reduce function (foldLeft), one element at a time can be passed in without having to store the complete list.

Second attempt

!!! This isn’t complete… see ActorMapReduce.scala !!!

Rearranging the types allows to have the shuffler forward the data immediately to the reducer:

Mapper
Inp => Iterator[(MapKey, Val)], using an Iterator instead of a Map obviates the need to store the data.
Shuffler
(MapKey, Val) => ShuffleKey
Reducer
ShuffleKey => Result => Val => Result

The proposed reducer isn’t equivalent to (ShuffleKey, List[V]) => (ShuffleKey, Result): assuming the reducer has a Result type of Either[Int, String] (let’s say in the WordCount example that if the count is even, we want the result as a number (e.g. Left(2)) and if the count is odd, we want a String (e.g. Right("3")).

To account for such cases, the reducer (with a new type ShuffleKey => TempResult => Val => TempResult) must be combined with a post-processing step ((ShuffleKey, TempResult) => Result).

This way, old-style reducer functions can be expressed with TempResult = List[Val], the new-style reducer function accumulating XXX…TODO…XXX and the old-style reducer function in the post-processing step, but more restricted reducer functions (and the WordCount example is one) can work with a trivial post-processing step (the identity function), but without building up the complete List[Val].

About

Toy implementation of map-reduce idea in Scala (coding assignment)


Languages

Language:Scala 100.0%