Manually set "incompressible data" threshold
lr4d opened this issue · comments
When working with tarballs of media files and PDF's, lrzip sometimes gives me a fast 15% compression by just compressing 1/9 blocks e.g. truncated output of lrzip -i -vv ...
:
Block Comp Percent Size
1 none 100.0% 10485760 / 10485760 Offset: 4589643704 Head: 10485799
2 none 100.0% 10485760 / 10485760 Offset: 4600129477 Head: 20971572
3 lzma 77.7% 8152171 / 10485760 Offset: 4610615250 Head: 29123756
4 none 100.0% 10485760 / 10485760 Offset: 4618767434 Head: 39609529
5 none 100.0% 10485760 / 10485760 Offset: 4629253207 Head: 50095302
6 none 100.0% 10485760 / 10485760 Offset: 4639738980 Head: 60581075
7 none 100.0% 10485760 / 10485760 Offset: 4650224753 Head: 71066848
8 none 100.0% 10485760 / 10485760 Offset: 4660710526 Head: 81567392
9 none 100.0% 4860321 / 4860321 Offset: 4671211070 Head: 0
Other times, it takes a lot longer and struggles a lot more to compress:
Block Comp Percent Size
1 none 100.0% 49603243 / 49603243 Offset: 5173302056 Head: 49603282
2 lzma 99.2% 49182901 / 49603243 Offset: 5222905312 Head: 98786196
3 lzma 99.3% 49236783 / 49603243 Offset: 5272088226 Head: 148022992
4 lzma 99.5% 49339373 / 49603243 Offset: 5321325022 Head: 197362378
5 lzma 99.0% 49131243 / 49603243 Offset: 5370664408 Head: 246493634
6 lzma 99.3% 49263853 / 49603243 Offset: 5419795664 Head: 295757500
7 lzma 98.5% 48863058 / 49603243 Offset: 5469059530 Head: 344620571
8 lzma 98.7% 48981972 / 49603243 Offset: 5517922601 Head: 393602556
9 lzma 99.4% 49286839 / 49603243 Offset: 5566904586 Head: 442889408
10 lzma 99.4% 49310169 / 49603243 Offset: 5616191438 Head: 492199590
11 lzma 99.4% 49295202 / 49603243 Offset: 5665501620 Head: 541494805
12 lzma 99.2% 49216341 / 49603243 Offset: 5714796835 Head: 590711159
13 lzma 99.4% 49310508 / 49603243 Offset: 5764013189 Head: 640021680
14 lzma 99.4% 49310012 / 49603243 Offset: 5813323710 Head: 689331705
15 lzma 99.3% 49260783 / 49603243 Offset: 5862633735 Head: 738592501
16 lzma 99.4% 49304015 / 49603243 Offset: 5911894531 Head: 787896529
17 lzma 99.3% 49272993 / 49603243 Offset: 5961198559 Head: 837169535
18 lzma 99.4% 49321970 / 49603243 Offset: 6010471565 Head: 886491518
19 lzma 99.4% 49315822 / 49603243 Offset: 6059793548 Head: 935807353
20 lzma 99.4% 49288914 / 49603243 Offset: 6109109383 Head: 985273666
21 lzma 90.2% 18692941 / 20716811 Offset: 6158575696 Head: 0
In the latter case, I'd much rather prefer lrzip to not compress any blocks if the expected compression ratio for lz4 is <= 95 %, so as to get faster speed.
Using lrzip --level=1
doesn't seem to make any difference in this regard.
Is it feasible to make the threshold which lrzip uses for determining when the data is incompressible a cli parameter so I can set it manually?
The -T | --threshold
option was designed to take an optional argument to limit Threshold testing to N%. Somehow that feature did not make it to lrzip
. -T
alone (an argument is not tested for) will disable threshold testing totally. Not limit it. This feature is implemented in lrzip-next
. -T95
for example, would test the lz4 compression against 95% and if above, would not compress that block. Good practice to use it in general. The time saved is a better value than the compression benefit. See This wiki article on it
I removed the optional percentage a while ago. You're the first person to request it be implemented. The way it works now however, it aborts way too early to have any idea what the percentage will be by the end of the block; its point is to avoid compressing incompressible blocks entirely and your request is a pretty unique use case. It could be extended to do what you ask but I'm not currently implementing new features.
I've decided this isn't worth implementing, apologies.