kriszyp / msgpackr

Ultra-fast MessagePack implementation with extension for record and structural cloning / msgpack.org[JavaScript/NodeJS]

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[Enhancement] Allow custom string dictionary + use location of repeated strings

joetex opened this issue · comments

I didn't know this package existed, and wrote my own, doh. But mine is scoped too heavily for my project. I want to switch to something a bit more flexible that has community support, and would love to get some of the reductions I implemented. In msgpackr for strings, I've been unable to get any boost from bundleStrings, which was odd.

Enhancements:

  1. Allow custom string dictionary. An array of commonly used strings that is fed identically to both Packr and Unpackr. It should only take up two bytes per string to lookup against this table for dictionary length of 255.

  2. Store location of repeated strings instead of encoding strings twice. If "hello" gets encoded at byte position 53, and the serializer sees "hello" again later, it should just encode the location position 53 for that 2nd "hello". Again, taking only 2 bytes or more if distance is greater than 255.

Feel free to see my own awful implementation, acos-json-encoder.
Edit: link goes to line where I implemented

I am looking for this enhancement too.
@kriszyp can you check this request?

I think the bundleStrings could be optimized by not saving the same string multiple times.

You might consider using CBOR packing, which was designed for this purpose:
https://github.com/kriszyp/cbor-x?tab=readme-ov-file#cbor-packing
However, this will only find exact string value duplicates (no duplicates within string, it won't do any compression of {foo: 'hello', bar: 'hello world'}. For more general string deduping, that is kind of the whole point RLE compression, and there are plenty of great compression formats and tools which are much better than anything msgpack could offer.

@kriszyp Thank you very much!
This is exactly what I was looking for. I only need to find exact string duplicates.

Great results!
original size: 386275 bytes
msgpackr (with useRecords): 101464 bytes
cbor (with useRecords and pack): 61865 bytes