What happens if we embed the probability tables?
Yoric opened this issue · comments
In #293, we have a mechanism to embed probability tables in the prelude. It actually seems to indicate that probability tables don't take that much space.
Could we possibly improve our compression results by giving up on the idea of shared probability tables and rather embedding the probability tables in the file?
Quick test with https://github.com/Yoric/binjs-ref/tree/entropy-0.4-embed and dictionary depth = 4
This is untested code.
File | raw | brotli | size vs master |
---|---|---|---|
js | 43134534 | 8016723 | 1 |
binjs | 10492698 | 10390535 | 1.1541068761235 |
floats.content | 336910 | 110130 | 1 |
floats.prelude | 126094 | 72487 | 1 |
identifier_names.content | 1136755 | 109247 | 0.109511689754136 |
identifier_names.prelude | 82185 | 51915 | 0.98087932435241 |
identifier_names_len.prelude | 25953 | 15604 | 0.987220043021637 |
interface_names.content | 583439 | 187726 | |
interface_names.prelude | 382683 | 125190 | |
interface_names_len.prelude | 20871 | 22379 | |
list_lengths.content | 1985830 | 550497 | 1 |
list_lengths.prelude | 10284 | 12132 | 1 |
main.entropy | 4638825 | 4641116 | 2.66085928528388 |
probabilities.prelude | 9157740 | 627889 | |
probabilities_len.prelude | 208540 | 122010 | |
property_keys.content | 2756485 | 230093 | 0.233311532592108 |
property_keys.prelude | 2899115 | 894790 | 0.885219887793069 |
property_keys_len.prelude | 201946 | 128929 | 0.891989124193136 |
string_enums.content | 28186 | 15093 | |
string_enums.prelude | 11260 | 12777 | |
string_enums_len.prelude | 5448 | 6273 | |
string_literals.content | 5468049 | 153240 | 0.174870421975885 |
string_literals.prelude | 5978408 | 1834340 | 0.928292334607095 |
string_literals_len.prelude | 364302 | 233533 | 0.931780186808495 |
unsigned_longs.content | 449125 | 91089 | 1 |
unsigned_longs.prelude | 4489 | 6337 | 1 |
I'm tracking a bug that increase a lot the amount of data we write to *.prelude.
Latest version
File | raw | brotli | size vs master |
---|---|---|---|
js | 43134534 | 8016723 | 1 |
binjs | 8073786 | 8026568 | 0.891534201123702 |
floats.content | 363023 | 154625 | 1.40402251884137 |
floats.prelude | 126094 | 72487 | 1 |
identifier_names.content | 2524124 | 997583 | 1 |
identifier_names.prelude | 86304 | 52927 | 1 |
identifier_names_len.prelude | 26637 | 15806 | 1 |
interface_names.content | 770395 | 254665 | |
interface_names.prelude | 388429 | 126610 | |
interface_names_len.prelude | 21193 | 22707 | |
list_lengths.content | 1986159 | 549604 | 0.998377829488626 |
list_lengths.prelude | 10284 | 12132 | 1 |
main.entropy | 1678455 | 1680352 | 0.963384716465898 |
probabilities.prelude | 946975 | 292452 | |
probabilities_len.prelude | 175284 | 76101 | |
property_keys.content | 1630992 | 986205 | 1 |
property_keys.prelude | 3150015 | 1010811 | 1 |
property_keys_len.prelude | 220936 | 144541 | 1 |
string_enums.content | 35874 | 26162 | |
string_enums.prelude | 11977 | 13416 | |
string_enums_len.prelude | 5691 | 6459 | |
string_literals.content | 2461767 | 1580965 | 1.80412435838623 |
string_literals.prelude | 6205428 | 1976037 | 1 |
string_literals_len.prelude | 380515 | 250631 | 1 |
unsigned_longs.content | 449125 | 96130 | 1.05534147921264 |
unsigned_longs.prelude | 4489 | 6337 | 1 |
We still embed much data that I'm pretty sure we don't need, but we're now within 1% of brotli. Pending roundtrip.
Latest version, depth 1, trying to use as much as possible the same protocol as binast/binjs-fbssdc#2.
Facebook sample set
$ cargo run --release --example sample_directory -- --in tests/data/facebook/single/ --sampling 0.2 --depth 1 --follow-symlinks false --min-size 0 --dictionary-threshold 0
binjs/brotli: 1.05
Real js samples
$ cargo run --release --example sample_directory -- --in ~/Downloads/scrap/ --sampling 0.2 --depth 1 --follow-symlinks false --min-size 0 --dictionary-threshold 0
binjs/brotli: 1.03