Netflix / aegisthus

A Bulk Data Pipeline out of Cassandra

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Using the new SSTable format

danchia opened this issue · comments

@danielbwatson given that we're trying to deprecate the JSON output format, I wonder what's the best way for people to write downstream jobs that want to process data in a row manner?

It seems to be that there are two options:

(1) Run the same Mapper and Reducer used in aegisthus, but use a ChainReducer so that we can add a custom map stage after to do the application specific processing.

(2) The SSTables output by Aegisthus are actually special, since it's guaranteed that rows are non-overlapping and the columns are sorted in the right order. I'm wondering if we could expose this to a mapper in some smart way (and avoid the reduce step).

What do you think?

I wanted to deprecate the old JSON format when it was created by the reducer, but now that it is actually an output format I don't mind supporting it. We will get rid of the JsonInputFormat. It will be a lot easier to support if we always consume SSTables.

As far as downstream jobs, I think both of your ideas are good. I could see use cases for both of them.

I'm going to close this issue and add a reference to it in the Enhancement section of the README.