4mc codecs should implement SplittableCompressionCodec

Question

4mc codecs should implement SplittableCompressionCodec

pradeepg26 opened this issue 7 years ago · comments

The implementation of Codec and InputFormat seems to follow the pattern from Elephantbird. However, this isn't a good pattern in my opinion. In the spirit of Hadoop, the concept of compression and file format should be decoupled. We should be able to change compression formats without needed to change the way those files are read.

Currently, if we change the compression from e.g. gz to 4mc, we need to change the InputFormat that is used to read the files, and we wouldn't be able to change the compression again. To do this gracefully, we would need to code defensively and dynamically change the InputFormats based on what files are in the input location. I don't think this strategy would work if you have a directory that has files that have been compressed with different formats.

In order to support this type of flexibility, the 4mc codecs should implement the SplittableCompressionCodec interface. This provides existing formats the ability to gracefully handle the new compression formats.

Carlo Medas · Answer 1 · Tue Apr 18 2017 02:45:48 GMT+0800 (China Standard Time)

Hello there.

Is this a new interface coming with a new hadoop version or something like that?

Pradeep Gollakota · Answer 2 · Wed Apr 19 2017 10:46:14 GMT+0800 (China Standard Time)

Nope, it's been around for a while. Take a look at BZip2Codec for an example on how it's intended to be used.

Carlo Medas · Answer 3 · Wed Apr 19 2017 14:48:01 GMT+0800 (China Standard Time)

You say you would like to change compression algo inside 4mc, but it's currently not supported.
As matter of fact to provide both lz4 and zstd I created both 4mc and 4mz, dedicated to each of them.
The good news is that a splittable compression format is now discussed in zstandard itself, so it's going to be available at the source itself very soon.

Pradeep Gollakota · Answer 4 · Thu Apr 20 2017 00:17:45 GMT+0800 (China Standard Time)

Great to hear that zstd is working on splittable compression format. I'll probably just wait for that.

In the mean time, I'm not proposing to change the compression algo inside 4mc. Just a refactor of the code to move where the splits are being adjusted. Currently the splits are being adjusted in the FourMcInputFormat and FourMzInputFormat in the getSplits method. If we adjusted the split boundaries inside the SplitCompressionInputStream instead, we wouldn't need the specialized input formats.

I'm working on a patch to implement this, should be out soon.

Carlo Medas · Answer 5 · Thu Apr 20 2017 14:30:36 GMT+0800 (China Standard Time)

OK perfect let me know.