Compression/decompression of multibyte characters fails

Question

Compression/decompression of multibyte characters fails

iconara opened this issue 10 years ago · comments

Using lz4-ruby 0.3.1:

input = 'ƒ∂' * 100
output = LZ4.uncompress(LZ4.compress(input))
output.force_encoding(Encoding::UTF_8)
input.should == output # false

The exact characters don't matter, it just seems to matter that they are multibyte. Is there a String#size call somewhere that should have been a String#bytesize?

Theo · Answer 1 · Tue May 20 2014 17:34:11 GMT+0800 (China Standard Time)

So I see that a few of the last commits actually change calls from #size to #bytesize, so is this fixed? Will there be a release soon?

KOMIYA Atsushi · Answer 2 · Wed May 21 2014 01:25:11 GMT+0800 (China Standard Time)

So I see that a few of the last commits actually change calls from #size to #bytesize, so is this fixed? Will there be a release soon?

Yes, I will soon release a version that fixes this problem.

KOMIYA Atsushi · Answer 3 · Fri May 23 2014 00:45:01 GMT+0800 (China Standard Time)

lz4-ruby 0.3.2 is released.

Rodney Carvalho · Answer 4 · Fri May 23 2014 23:09:49 GMT+0800 (China Standard Time)

iconara, the issue that I was having, I realized, is actually this same multi-byte problem you were having. I mischaracterized it with the bug that I had originally filed. But, I now see that the issue has to deal with multi-byte strings because after I force_encode UTF8 on my string the resulting uncompressed size is different than the original input string.

Just tested the 0.3.2 fix out and it's working for me. Thanks, komiya-atsushi!