how to setup a compressUTF8() variant

Question

how to setup a compressUTF8() variant

nielsnl68 opened this issue 3 years ago · comments

Hi, could you give me some hints to change the compressToUTF16() to compressToUTF8()?
I use an other lib to encrypt the string but it cant handle utf16 atm.

pieroxy · Answer 1 · Thu Aug 05 2021 19:40:23 GMT+0800 (China Standard Time)

Can you tell us why you need a utf8 implementation? It makes little to no sense at all.

You talk about encryption but utf (8 or 16) is about encoding. Can you tell us more about what you’re trying to do?

NP v/d Spek · Answer 2 · Wed Aug 11 2021 16:51:41 GMT+0800 (China Standard Time)

Hello @pieroxy ,

Thanks for your reply, in my case i use our compression module to talk between the server and the clients over ajax call's and compressed url's to activate mailbox connections.
At the moment all communications between each other is based on the utf8 encoding. I tried using your compressUTF16 solution but i found that the data was not send correctly or better the receiving part saw some end of file bytes and diched the remainder of the data,

At the moment i use de base64 solution but now the packages are 2x as big.

So i was hoping that we could convert the encrypted data into a utf8 string like you do with utf16.

Ryc O'Chet · Answer 3 · Wed Aug 11 2021 20:03:33 GMT+0800 (China Standard Time)

Just to note - Javascript does not support UTF8 - it converts any text strings etc into UTF16, which means there is potentially data change when converting (one UTF8 character might be >1 UTF16 character etc).

Saying that, if you need to send as UTF8 then something like https://gist.github.com/joni/3760795/8f0c1a608b7f0c8b3978db68105c5b1d741d0446 might be a good starting point for how to convert - you'll still need to send it as raw binary data from the array. Decoding is another matter as it'll need converting to binary without allowing JS itself to convert to UTF16 (which will break it entirely).

Are you sure it won't be easier to put a UTF16 support library on the backend instead?

NP v/d Spek · Answer 4 · Wed Aug 11 2021 21:10:36 GMT+0800 (China Standard Time)

I did not know that javascript did not support utf8, this is the first time i read about it.

That gist example does look perfect.

Ryc O'Chet · Answer 5 · Wed Aug 11 2021 21:14:24 GMT+0800 (China Standard Time)

I did not know that javascript did not support utf8, this is the first time i read about it.

Pretty good writeup including references - https://flaviocopes.com/javascript-unicode/

NP v/d Spek · Answer 6 · Wed Aug 11 2021 22:31:19 GMT+0800 (China Standard Time)

okey, then i makes me wonder why ajax calls are breaking.
I will investigate that soon.

It does make sense now to change

Ryc O'Chet · Answer 7 · Wed Aug 11 2021 22:38:57 GMT+0800 (China Standard Time)

Binary is always a safer transport mechanism than some encoding ;-)

NP v/d Spek · Answer 8 · Fri Aug 20 2021 00:22:44 GMT+0800 (China Standard Time)

okay, i created my own fromUTF8() and toUTF8() functions so they a readable string from a byteArry,

I understand how your _compress() function works. like: _compress(uncompressed, 8,function (a) { return f(a); });
But i have no clue as how i should set the _decompress() function's second parameter.

could you help with that?

const UTF8convert = [
    0x01, 0x06, 0x3D, 0x3E, 0x5a,  // 
    0x60, 0x62, 0x64, 0x68, 0x69, 0x6B, 0x70, 0x73, 0x93
];

function toUTF8(data) { // array of bytes
    const char = (value) => {
        let q = UTF8convert.indexOf(value);
        if (q >= 0) {
            return String.fromCharCode(q + 0xE0);
        } else if (value < 0x5d) {
            return String.fromCharCode(value + 0x21);
        } else {
            return String.fromCharCode(value + 0x44);
        }
    }

    var str = '',
        shift = 0,
        i;

    for (i = 0; i < data.length; i++) {
        var value = (data[i] << (i % 7)) + shift;
        shift = value >> 7;
        value = value & 0x7f;
        str += char(value);
        if (i % 7 === 6) {
            str += char(shift);
            shift = 0;
        }
    }
    str += char(shift);
    return str;
}

function fromUTF8(str) {
    var utf8 = [], shift = 0, x = 0;
    for (var i = 0; i < str.length; i++) {
        var charcode = str.charCodeAt(i);
        if (charcode >= 0xe0) {
            charcode = UTF8convert.at(charcode - 0xe0);
        } else if (charcode >= 0xA1) {
            charcode = charcode - (0x44);
        } else {
            charcode = charcode - (0x21);
        }
        if ((i % 8) > 0) {
            shift = (charcode << (8 - (i % 8))) & 0xff;
            utf8[utf8.length - 1] += shift;
        }
        if ((i == 0 || ((i % 8) != 7)) && (i < str.length - 1)) {
            charcode = charcode >> (x % 7);
            x++;
            utf8.push(charcode);
        }
    }
    return utf8;
}

Erin Rivas · Answer 9 · Mon Jan 22 2024 13:36:14 GMT+0800 (China Standard Time)

Due to the listed situation with Javascript using UTF16 and not UTF8, I think the best solution if UTF8 is truly desired would be compressToUint8Array() and encoding the resulting output with some implementation-specific function.

This can probably be closed as a "won't implement"

Ryc O'Chet · Answer 10 · Mon Jan 22 2024 17:16:34 GMT+0800 (China Standard Time)

Not sure - NodeJS is supposed to support it better, but we need compatibility between front-end and back-end - there is TextEncoder and TextDecoder to play with once everything else is updated properly - holding off closing until we've had a chance :-P