Switch to base65536

Question

Switch to base65536

paulmillr opened this issue 5 years ago · comments

Hey — Great idea you have there!

I suggest to compress the data more; by using base65536: https://www.npmjs.com/package/base65536

Daniel Lo Nigro · Answer 1 · Mon Jul 01 2019 03:18:10 GMT+0800 (China Standard Time)

That won't work well... URLs don't allow Unicode characters, so the characters would need to be URL encoded, which would defeat the purpose as the encoded data would almost certainly up longer than the original data. You can get the shortest URLs only by using characters that do not need URL encoding.

You'd have more luck using an actual compression algorithm that produces URL-friendly output, for example the REPL on the Babel site uses lz-string: https://github.com/babel/website/blob/21ca7a26f47dbfe1cae8a2300d3a60b2c39f5c8e/js/repl/UriUtils.js#L31-L42

Paul Miller · Answer 2 · Mon Jul 01 2019 04:34:02 GMT+0800 (China Standard Time)

URLs don't allow Unicode characters

That may have been correct in the past, I don't think it still holds up today. See:

https://ko.wikipedia.org/wiki/위키백과:대문 — copy-paste this, no encoding is done
https://url.spec.whatwg.org/

Daniel Lo Nigro · Answer 3 · Mon Jul 01 2019 04:56:13 GMT+0800 (China Standard Time)

Ah, good catch, I didn't realise that Punycode wasn't required for URLs any more.

In any case, UTF-8 characters can be up to four bytes. That base65535 encoding isn't actually compressing the data, just encoding it differently. It still takes up the same amount of storage space, just with fewer characters, which has arguably minimal benefit (you're not saving any bandwidth, for example). It seems useful for cramming lots of data into a Tweet by exploiting the way that Twitter counts the length of tweets, but not for anything else :P

You'll still see better results from an actual compression algorithm, as then the data will actually be compressed and take up less space.

James Futhey · Answer 4 · Mon Jul 01 2019 08:28:57 GMT+0800 (China Standard Time)

I did a couple tests with pretty large pages I had in codepen, actual reduction was only from 9kb to 8kb in the largest (and not noticeable in the smallest). There's probably a use-case though where this does more to actually compress a string, but it seems to perform similar to atob().

Paul Miller · Answer 5 · Mon Jul 01 2019 09:29:58 GMT+0800 (China Standard Time)

Yeah, it should obviously work, because that's like the whole idea of base65k. Take a look here, i've created my own implementation of base65k — very naïve, but I didn't want to glance over packages code if it didn't work for you. Create b65k.js in urlpages root directory and execute node b65k.js.

The result is 2.2x reduction:

file size 506
btoa size 676
65k size 253

const fs = require('fs').promises;

function btoa(text) {
  // node version of the browser function.
  return Buffer.from(text).toString('base64');
}

function b65k(text) {
  // Test if every char is ASCII.
  for (let i = 0; i < text.length; i++) {
    if (text.charCodeAt(i) > 255) {
      console.log('non-ASCII');
      return text;
    }
  }

  // The text was ASCII.
  // Combine two symbols into one.
  // Max value won't exceed 65k: 255 * 255 = 65535
  let compressed = '';
  for (let i = 0; i < text.length; i += 2) {
    const fst = text.codePointAt(i).toString(16).padStart(2, '0');
    const snd = text.codePointAt(i + 1).toString(16).padStart(2, '0');
    const pair = ((i + 1) < text.length) ? fst + snd : fst;
    compressed += String.fromCodePoint(parseInt(pair, 16));
  }
  return compressed;
}

function deb65k(compressed) {
  let text = '';
  for (let i = 0; i < compressed.length; i++) {
    const pair = compressed.codePointAt(i).toString(16).padStart(4, '0');
    const fst = parseInt(pair[0] + pair[1], 16);
    text += String.fromCodePoint(fst);
    const snd = parseInt(pair[2] + pair[3], 16);
    text += String.fromCodePoint(snd);
  }
  return text;
}

(async () => {
  const text = await fs.readFile('editor/main.css', 'utf-8');
  console.log('file size', text.length);
  console.log('btoa size', btoa(text).length);
  console.log('65k size', b65k(text).length);
  // Identical
  // console.log(deb65k(b65k(text)));
})();

Daniel Lo Nigro · Answer 6 · Mon Jul 01 2019 12:08:12 GMT+0800 (China Standard Time)

@paulmillr - String.prototype.length measures the number of code points, not the byte length of the string. '문'.length returns 1 as it's just one character, even though the character actually takes 3 bytes of space. The space taken is what we care about here.

Changing your script to output the number of bytes in the string:

console.log('file size', Buffer.byteLength(text, 'utf-8'));
console.log('btoa size', Buffer.byteLength(btoa(text), 'utf-8'));
console.log('65k size', Buffer.byteLength(b65k(text), 'utf-8'));

Shows b65k has a clear disadvantage:

λ node c:\temp\b65k.js
(node:6268) ExperimentalWarning: The fs.promises API is experimental
file size 506
btoa size 676
65k size 759

And LZString actually helps:

const LZString = require('lz-string');
...
console.log(
  'LZString size',
  Buffer.byteLength(LZString.compressToBase64(text), "utf-8")
);

results in:

LZString size 440

because that's like the whole idea of base65k.

No, the whole idea of base65k is to take advantage of Twitter counting a character as a UTF-8 codepoint, rather than counting the number of raw bytes. It's not really useful as an actual compression or transport strategy otherwise.

Paul Miller · Answer 7 · Mon Jul 01 2019 12:31:03 GMT+0800 (China Standard Time)

I'm talking here about browser limitations for URL parsing & URL parsing performance. I'm not talking about whether the result is less bytes; but the result definitely has less characters. So this should be twice as fast to parse.

Jacob Strieb · Answer 8 · Mon Jul 01 2019 12:46:53 GMT+0800 (China Standard Time)

One use case of the project is having long URLs that can be stored by link shorteners with upper limits on URL size. With that use case in-mind, what matters most is whether the shorteners care about bytes or characters when measuring URL size, and how they handle Unicode URLs. My guess is that it's variable, but still merits implementing a base65536 encoding feature.

Tangential note: I'm currently looking into integrating Brotli for real URL compression since it was designed for exactly this type of data and will hopefully be able to get URL lengths considerably shorter.

Daniel Lo Nigro · Answer 9 · Tue Jul 02 2019 00:18:37 GMT+0800 (China Standard Time)

With that use case in-mind, what matters most is whether the shorteners care about bytes or characters when measuring URL size

@jstrieb - Shorteners usually care about bytes, given they need to store the data as bytes in their database. Regardless of if you have four characters that take one byte each, or one character that takes four bytes, that's still going to consume four bytes in a database. I ran a URL shortener for many years.

Tangential note: I'm currently looking into integrating Brotli for real URL compression since it was designed for exactly this type of data

I'd be interested in hearing how Brotli compares to LZString for this data! Looking forward to seeing the results of that 😃