rotemdan / lzutf8.js

A high-performance Javascript string compression library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Compatible with gzip?

burtonator opened this issue · comments

Are the compressed strings compatible with gzip? That would make interop easily so that the raw data could be worked with using external tools.

lz-utf8 was mostly designed and developed in the summer of 2014, before asm.js (introduced in 2013) and the later WebAssembly (introduced in 2017) were widely supported by browsers. Back then Javascript ports for Gzip did exist, but were very slow and impractical for most uses.

At the time I thought it was a good/"cool" idea to design a compression format that was simple and fast enough to run within the browser, and since it was designed only for strings (not general binary data), I thought it would be nice that it could be somehow compatible with utf8 strings (so plain utf8 strings are a binary valid form of lz-utf8 binary data, thus, for example, you could send plain utf8 strings from a server and decode them with the lz-utf8 decoder with no issues). Internally it differentiates itself from utf8 by modifying a single bit. In more detail:

The utf-8 codepoint is one of the forms:

  • 0xxxxxxx
  • 110xxxxx 10xxxxxx
  • 1110xxxx 10xxxxxx 10xxxxxx
  • 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

So in lz-utf8 I used these two non-conflicting patterns to differentiate itself without breaking utf-8 (see more details in the technical paper):

  • 110lllll 0ddddddd
  • 111lllll 0ddddddd dddddddd

There's no compatibility with any other known format. It's a unique design. It also doesn't include nor support any headers. There's zero overhead - non compressible string will have exactly the length of the original string. This also means it cannot be made compatible with any other format that has headers (like gzip).

Today with WebAssembly of course you could just run a port of zlib (or use some wrapper library) in the browser with near-native performance.

And of course the core lzutf8 could (probably "should" at this point) be rewritten in C/Rust to be compiled WebAssembly (to get native [possibly up to 10-20x] performance in the browser and server and proper command line tools) but I don't have the time or feel the incentive to do so.

So it's basically a bit of a "novelty" format that I made a while back because I thought it was somehow cool, and I'm not especially excited about making further development to, since that probably wouldn't really give me any real benefits or value.

Thanks. this was my thinking as well. I'm trying to find a gzip impl written in WebAssembly (ideally) that works with a web worker.. haven't found one yet :-/

Nowadays you can use Compression Streams API on both pages and in web workers. However, you'll have to encode the binary output using some text-friendly encoding, if you want to store the output as string. So, this library is still worth looking at, utilising all UTF-8 characters.