pieroxy / lz-string

LZ-based compression algorithm for JavaScript

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Is there a compression algorithm without ZERO width char ?

finscn opened this issue · comments

In some android browser , it can't run LZString.decompress correctly when the string include ZERO WIDTH char (e.g. 0x80 0x86 ...)

Is there a method of LZString could do compress without ZERO WIDTH char ?
Thanks

Try LZString.compressToUTF16

@pieroxy , I tried . no use.
And I found the key point is not ZERO width char.
It's \u2028 & \u2029.
if string includes these 2 chars, the JSON.parse will take some problems.

Is there any method make the result of LZString.compress without \u2028 & \u2029 ?

test case (in some android browsers , You know ,in China there are many brands of Mobile Phone, and they use many custom browsers )

var compressed = LZString.compressToUTF16(str);
var json = {
    "data": compressed;
}

var jsonStr = JSON.stringify(json);

// CAN'T parse
var obj = JSON.parse(jsonStr);

if no \u2028 & \u2029 , everything is ok.

If you look at compressToUTF16, it is easy to adapt to your use case:

compressToUTF16 : function (input) {
    if (input == null) return "";
    return LZString._compress(input, 15, function(a){return f(a+32);}) + " ";
  }

You can add a simple check in the lambda provided ( return f(a+32) ) to map both characters you want to avoid to other characters (for example 70001 and 70002).
You will have to implement the reverse lambda in the corresponding decompress method.

compressToBase64

Here you go: https://gist.github.com/JobLeonard/7a49b8e5adf17d9a3783ffcfa21eec3f

I also removed the string-characters, so:

`
"
'

So copying/pasting a compressed string won't suddenly lead to strange broken string behavior.

Did a quick test round here: https://observablehq.com/d/d223e05380aa85c9/

You can do some easy tricks.

For example, add a fixed 4 chars to beginning of every compressed string, which is not found in the compressed string.
Then change your problem chars \u2028 & \u2029 to M\u1E28 M\u1E29.
So you can decode it back by M\u1E28 M\u1E29 to \u2028 & \u2029

choose of chars in M:

  • above 0xFF
  • random range is large
  • excluded from [\u2000-\u200A\u202F\u205F\u3000\u200B-\u200D\u2060\uFEFF\u180E\u2800\u3164\u2800]
  • do no conflict with offset

so, \u3165 - \uFEFE should be a good range.
consider negative offset 0x200, then \u3165 - \uFCFE

function rndInt(a, b) {
  return Math.floor(Math.random() * (b - a + 1)) + a;
}

function compress(s) {
  let z = LZString.compress(s);
  let m = '';
  do {
    m = Array.from({ length: 4 }, () => String.fromCharCode(rndInt(0x3165, 0xFCFE))).join('') // generate a fixed length prefix
  } while (z.includes(m)) // ensure it is not used in the compression
// breaking: \u2028\u2029
// zero-width: \u200B-\u200D\u2060\uFEFF
// problem chars: \u180E\u2800\u3164
// other white spaces: \u2000-\u200A\u202F\u205F\u3000 
  z = z.replace(/[\u2028\u2029\u200B-\u200D\u2060\uFEFF\u180E\u2800\u3164\u2000-\u200A\u202F\u205F\u3000]/g, e => m + String.fromCharCode(e.charCodeAt() - 0x200)); // offset -0x200
  return m + z;
}

function decompress(z) {
  let m = z.substring(0, 4);
  z = z.substring(4);
  z = z.replace(new RegExp(`${m}(.)`, 'g'), (e, f) => String.fromCharCode(f.charCodeAt() + 0x200)); // offset +0x200
  const s = LZString.decompress(z);
  return s;
}

Note:

  • Before offset ['2028', '2029', '200B', '200D', '2060', 'FEFF', '180E', '2800', '3164', '2000', '200A', '202F', '205F', '3000']
  • After offset: ["1E28","1E29","1E0B","1E0D","1E60","FCFF","160E","2600","2F64","1E00","1E0A","1E2F","1E5F","2E00"]

Another approach

like escape unescape. instead of slash, do with the least occurence char t to ensure min output.

  1. find the least occurrence char t

  2. replace all of them to t0

  3. replace your unwanted char to t1 t2 t3 ... t9 ta tb... etc

  4. put the chat t in the prefix.

then you can reverse back to decode it. first char is t. t1 t2 t3 ... at the end. replace t0 to t

this should be very efficient and reliable.

function findLeastFrequentChar(text) {
    const charCount = {};
    let minCount = Infinity;
    let leastFrequentChar = '';

    // Count the occurrences of each character
    for (let char of text) {
        if (!charCount[char]) {
            charCount[char] = 0;
        }
        charCount[char]++;
    }

    // Find the character with the least occurrences
    for (let char in charCount) {
        if (charCount[char] < minCount) {
            minCount = charCount[char];
            leastFrequentChar = char;
        }
    }

    return leastFrequentChar;
}