antonylesuisse / lzwjs

lzwjs, LZW designed to encode javascript Strings for storage in UTF-8 encoded files or strings.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

lzwjs

LZW designed to encode javascript Strings for storage in UTF-8 encoded files or strings (such as HTML or XML) in a subset of ascii.

We use a code formed by 3 base characters. The UTF-8 encoding restrict us to characters below 0x80 which have 85.5% density (0x00-0x7F encoding: 0xxxxxxx 7/8 = 87.5%), as the density decrease above (i.e. 0x080-0x7FF encoding is 110xxxxx 10xxxxxx thus only 11/16 = 68.75%).

By default base=64 uses [A-Z] [a-z] [0-9] "-_", so that it is humanly readable, printable, pastable and usable in xml attributes or URLs without any escape.

The waste is only 14.2% log(64)/log(128), 1bit unused per byte, if you take into account the UTF-8 restriction or 25% log(64)/log(256), 2bit unused per byte if you compare it to raw binary files.

It also supports bases above 64:

  • 86 default + "!#%()*+,./:;=?@[]^{|}~" safe extra printable ascii
  • 87 above + " " space char
  • 89 above + "$`" unsafe for js template string
  • 120 above + [\x01\x1f] non null control chars
  • 121 above + "\x1f" delete char, safe for xml text and attributes
  • 124 above + "&'>" safe for javascript strings in html
  • 125 above + "\00" null still safe for javascript strings in html
  • 128 above + '<"\' all ascii

Long enough strings are needed for the algorithm to be efficient as we need to fill the code size which is base**3, (around 4MB for base64, around 30MB for base=120, around 50MB for base 128)

Benchmarks relative to gzip/bzip2 confirms this. It's on par with gzip/bzip2 for once you take into account the base encoding waste log(base)/log(256).

Deeply Inspired by https://gist.github.com/revolunet/843889 lzw_encoder.js

https://github.com/antonylesuisse/lzwjs

Public domain. Antony Lesuisse. 2021.

About

lzwjs, LZW designed to encode javascript Strings for storage in UTF-8 encoded files or strings.


Languages

Language:JavaScript 100.0%