datasalt / pangool

Tuple MapReduce for Hadoop: Hadoop API made easy

Home Page:http://datasalt.github.io/pangool/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pangool Streaming ?

epalace opened this issue · comments

Is able Pangool to work with Hadoop Streaming ?

Current parameters of Hadoop Streamming are:

Options:
-input <path> DFS input file(s) for the Map step
-output <path> DFS output directory for the Reduce step
-mapper <cmd|JavaClassName> The streaming command to run
-combiner <cmd|JavaClassName> The streaming command to run
-reducer <cmd|JavaClassName> The streaming command to run
-file <file> File/dir to be shipped in the Job jar file
-inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional.
-outputformat TextOutputFormat(default)|JavaClassName Optional.
-partitioner JavaClassName Optional.
-numReduceTasks <num> Optional.
-inputreader <spec> Optional.
-cmdenv <n>=<v> Optional. Pass env.var to streaming commands
-mapdebug <path> Optional. To run this script when a map task fails
-reducedebug <path> Optional. To run this script when a reduce task fails
-io <identifier> Optional.

The reduce script receives all the data without being grouped. So the script is responsible of detecting changes in key, and creating manually the groups.

Seems we could configure the streaming job, allowing to define the group by and sort by options. The reduce and combiner script would be called once per group. That could be inefficient, as the start up&down times of the scripts can be relevant. But, by the other side, maybe is useful.

We could also allow to provide an intermediate schema, so than text is translated to Tuples after the mapper. That allows:

  • Smaller serialization size: primitive types (int, double, etc) are serialized as bytes, not strings
  • Improved sorting: sorting by numbers does not need padding
  • Allows for sorting by fields in a different order they have in the input record without rewriting the record in the mapper

Sorry, I don't get what has "Hadoop Streaming" to do with Pangool.

In my mind one uses Hadoop Streaming for orthogonal reasons to those for using Pangool or Java MapRed.

Unless you can ellaborate more on why is this useful... I don't see it. There are already very good APIs on top of Hadoop Streaming like Python MapRed APIs.

It could have sense at some point to build some kind of "Hadoop Streaming"
but on top of Pangool, by doing use of its power for managing schemas. It
would be more efficient than the default Hadoop Streaming in the sense that
the intermediate serialization would be much efficient. Also, results could
be optionally written in TupleFiles easily.

Anyway, I don't see that as a big priority, so I would close the ticket.

2013/10/1 Pere Ferrera notifications@github.com

Sorry, I don't get what has "Hadoop Streaming" to do with Pangool.

In my mind one uses Hadoop Streaming for orthogonal reasons to those for
using Pangool or Java MapRed.

Unless you can ellaborate more on why is this useful... I don't see it.
There are already very good APIs on top of Hadoop Streaming like Python
MapRed APIs.


Reply to this email directly or view it on GitHubhttps://github.com//issues/6#issuecomment-25433819
.

Iván de Prado
CEO & Co-founder
www.datasalt.com