Using the new SSTable format
danchia opened this issue · comments
@danielbwatson given that we're trying to deprecate the JSON output format, I wonder what's the best way for people to write downstream jobs that want to process data in a row manner?
It seems to be that there are two options:
(1) Run the same Mapper and Reducer used in aegisthus, but use a ChainReducer so that we can add a custom map stage after to do the application specific processing.
(2) The SSTables output by Aegisthus are actually special, since it's guaranteed that rows are non-overlapping and the columns are sorted in the right order. I'm wondering if we could expose this to a mapper in some smart way (and avoid the reduce step).
What do you think?
I wanted to deprecate the old JSON format when it was created by the reducer, but now that it is actually an output format I don't mind supporting it. We will get rid of the JsonInputFormat. It will be a lot easier to support if we always consume SSTables.
As far as downstream jobs, I think both of your ideas are good. I could see use cases for both of them.
I'm going to close this issue and add a reference to it in the Enhancement section of the README.