fingltd / 4mc

4mc - splittable lz4 and zstd in hadoop/spark/flink

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

4mc codecs should implement SplittableCompressionCodec

pradeepg26 opened this issue · comments

The implementation of Codec and InputFormat seems to follow the pattern from Elephantbird. However, this isn't a good pattern in my opinion. In the spirit of Hadoop, the concept of compression and file format should be decoupled. We should be able to change compression formats without needed to change the way those files are read.

Currently, if we change the compression from e.g. gz to 4mc, we need to change the InputFormat that is used to read the files, and we wouldn't be able to change the compression again. To do this gracefully, we would need to code defensively and dynamically change the InputFormats based on what files are in the input location. I don't think this strategy would work if you have a directory that has files that have been compressed with different formats.

In order to support this type of flexibility, the 4mc codecs should implement the SplittableCompressionCodec interface. This provides existing formats the ability to gracefully handle the new compression formats.

Hello there.

Is this a new interface coming with a new hadoop version or something like that?

Nope, it's been around for a while. Take a look at BZip2Codec for an example on how it's intended to be used.

You say you would like to change compression algo inside 4mc, but it's currently not supported.
As matter of fact to provide both lz4 and zstd I created both 4mc and 4mz, dedicated to each of them.
The good news is that a splittable compression format is now discussed in zstandard itself, so it's going to be available at the source itself very soon.

Great to hear that zstd is working on splittable compression format. I'll probably just wait for that.

In the mean time, I'm not proposing to change the compression algo inside 4mc. Just a refactor of the code to move where the splits are being adjusted. Currently the splits are being adjusted in the FourMcInputFormat and FourMzInputFormat in the getSplits method. If we adjusted the split boundaries inside the SplitCompressionInputStream instead, we wouldn't need the specialized input formats.

I'm working on a patch to implement this, should be out soon.

OK perfect let me know.