kspalaiologos / bzip3

A better and stronger spiritual successor to BZip2.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

data integrity failure on truncated stream

kilobyte opened this issue · comments

Unlike other Unix compressors, bzip3 fails to notice data truncation if the compressed stream ends at a block boundary. There's no way to distinguish such a truncation, leading to silent data loss.

Furthermore, while compressed block boundaries are "random" by length, the timing pattern makes it very likely such truncations happen naturally, with no malice involved. The library writes a series of blocks, takes a long while processing a new series, and only then resumes output. Thus, any mishap (crash, power loss, a network failure, a pendrive being ejected, a backup snapshot, OOM, a timeout, etc) will very likely make the file appear to be correctly terminated. This is compounded by the tool forcing a flush at a block boundary — something normally beneficial due to cache locality, but here the block tail stuck in stdio buffers would at least make the error noisy.

Alas, while it'd be easy to add such a marker (a block header with length=0 or a magic value >511MB), any such change would break bytestream compat, thus breaking compatibility with current version of the library.

One way to prevent this situation from happening would be immediately testing the file using bzip3 -t to determine if the compressed size matches the decompressed size.

There's no record of the decompressed size anywhere; in fact there's even no way to know it beforehand if the input comes from a pipe or /proc.