Using the new SSTable format

Question

Using the new SSTable format

danchia opened this issue 10 years ago · comments

@danielbwatson given that we're trying to deprecate the JSON output format, I wonder what's the best way for people to write downstream jobs that want to process data in a row manner?

It seems to be that there are two options:

(1) Run the same Mapper and Reducer used in aegisthus, but use a ChainReducer so that we can add a custom map stage after to do the application specific processing.

(2) The SSTables output by Aegisthus are actually special, since it's guaranteed that rows are non-overlapping and the columns are sorted in the right order. I'm wondering if we could expose this to a mapper in some smart way (and avoid the reduce step).

What do you think?

Daniel Watson · Answer 1 · Wed Oct 22 2014 02:07:47 GMT+0800 (China Standard Time)

I wanted to deprecate the old JSON format when it was created by the reducer, but now that it is actually an output format I don't mind supporting it. We will get rid of the JsonInputFormat. It will be a lot easier to support if we always consume SSTables.

As far as downstream jobs, I think both of your ideas are good. I could see use cases for both of them.

Daniel Watson · Answer 2 · Wed Jan 06 2016 08:24:46 GMT+0800 (China Standard Time)

I'm going to close this issue and add a reference to it in the Enhancement section of the README.