Is there a compression algorithm without ZERO width char ?

Question

Is there a compression algorithm without ZERO width char ?

finscn opened this issue 6 years ago · comments

In some android browser , it can't run LZString.decompress correctly when the string include ZERO WIDTH char (e.g. 0x80 0x86 ...)

Is there a method of LZString could do compress without ZERO WIDTH char ?
Thanks

pieroxy · Answer 1 · Fri Jan 18 2019 21:34:57 GMT+0800 (China Standard Time)

Try LZString.compressToUTF16

finscn · Answer 2 · Sat Jan 19 2019 13:49:01 GMT+0800 (China Standard Time)

@pieroxy , I tried . no use.
And I found the key point is not ZERO width char.
It's \u2028 & \u2029.
if string includes these 2 chars, the JSON.parse will take some problems.

Is there any method make the result of LZString.compress without \u2028 & \u2029 ?

test case (in some android browsers , You know ,in China there are many brands of Mobile Phone, and they use many custom browsers )

var compressed = LZString.compressToUTF16(str);
var json = {
    "data": compressed;
}

var jsonStr = JSON.stringify(json);

// CAN'T parse
var obj = JSON.parse(jsonStr);

if no \u2028 & \u2029 , everything is ok.

pieroxy · Answer 3 · Fri Feb 01 2019 20:42:58 GMT+0800 (China Standard Time)

If you look at compressToUTF16, it is easy to adapt to your use case:

compressToUTF16 : function (input) {
    if (input == null) return "";
    return LZString._compress(input, 15, function(a){return f(a+32);}) + " ";
  }

You can add a simple check in the lambda provided ( return f(a+32) ) to map both characters you want to avoid to other characters (for example 70001 and 70002).
You will have to implement the reverse lambda in the corresponding decompress method.

Bart Lanen · Answer 4 · Wed Apr 03 2019 18:45:58 GMT+0800 (China Standard Time)

compressToBase64

Job van der Zwan · Answer 5 · Thu Apr 04 2019 22:44:56 GMT+0800 (China Standard Time)

Here you go: https://gist.github.com/JobLeonard/7a49b8e5adf17d9a3783ffcfa21eec3f

I also removed the string-characters, so:

`
"
'

So copying/pasting a compressed string won't suddenly lead to strange broken string behavior.

Did a quick test round here: https://observablehq.com/d/d223e05380aa85c9/

cyfung1031 · Answer 6 · Tue Jan 09 2024 01:35:12 GMT+0800 (China Standard Time)

You can do some easy tricks.

For example, add a fixed 4 chars to beginning of every compressed string, which is not found in the compressed string.
Then change your problem chars \u2028 & \u2029 to M\u1E28 M\u1E29.
So you can decode it back by M\u1E28 M\u1E29 to \u2028 & \u2029

choose of chars in M:

above 0xFF
random range is large
excluded from [\u2000-\u200A\u202F\u205F\u3000\u200B-\u200D\u2060\uFEFF\u180E\u2800\u3164\u2800]
do no conflict with offset

so, \u3165 - \uFEFE should be a good range.
consider negative offset 0x200, then \u3165 - \uFCFE

function rndInt(a, b) {
  return Math.floor(Math.random() * (b - a + 1)) + a;
}

function compress(s) {
  let z = LZString.compress(s);
  let m = '';
  do {
    m = Array.from({ length: 4 }, () => String.fromCharCode(rndInt(0x3165, 0xFCFE))).join('') // generate a fixed length prefix
  } while (z.includes(m)) // ensure it is not used in the compression
// breaking: \u2028\u2029
// zero-width: \u200B-\u200D\u2060\uFEFF
// problem chars: \u180E\u2800\u3164
// other white spaces: \u2000-\u200A\u202F\u205F\u3000 
  z = z.replace(/[\u2028\u2029\u200B-\u200D\u2060\uFEFF\u180E\u2800\u3164\u2000-\u200A\u202F\u205F\u3000]/g, e => m + String.fromCharCode(e.charCodeAt() - 0x200)); // offset -0x200
  return m + z;
}

function decompress(z) {
  let m = z.substring(0, 4);
  z = z.substring(4);
  z = z.replace(new RegExp(`${m}(.)`, 'g'), (e, f) => String.fromCharCode(f.charCodeAt() + 0x200)); // offset +0x200
  const s = LZString.decompress(z);
  return s;
}

Note:

Before offset ['2028', '2029', '200B', '200D', '2060', 'FEFF', '180E', '2800', '3164', '2000', '200A', '202F', '205F', '3000']
After offset: ["1E28","1E29","1E0B","1E0D","1E60","FCFF","160E","2600","2F64","1E00","1E0A","1E2F","1E5F","2E00"]

Another approach

like escape unescape. instead of slash, do with the least occurence char t to ensure min output.

find the least occurrence char t
replace all of them to t0
replace your unwanted char to t1 t2 t3 ... t9 ta tb... etc
put the chat t in the prefix.

then you can reverse back to decode it. first char is t. t1 t2 t3 ... at the end. replace t0 to t

this should be very efficient and reliable.

function findLeastFrequentChar(text) {
    const charCount = {};
    let minCount = Infinity;
    let leastFrequentChar = '';

    // Count the occurrences of each character
    for (let char of text) {
        if (!charCount[char]) {
            charCount[char] = 0;
        }
        charCount[char]++;
    }

    // Find the character with the least occurrences
    for (let char in charCount) {
        if (charCount[char] < minCount) {
            minCount = charCount[char];
            leastFrequentChar = char;
        }
    }

    return leastFrequentChar;
}