Can utf-8 codes be avoided in the output?

Question

Can utf-8 codes be avoided in the output?

marwan-nwh opened this issue 8 months ago · comments

Hi,

I am using this library in an ERP system called NetSuite.

When I get an output from LZString.compress like 㚆ࡠ왐¸ὀ阁㌀룀☁᠄ᩈ恅唐聛ō瀿\x18ţ怂Ƅސȭ쀌먈\x8D퀁띌鴣Î\uDC71큣\x0E넕챞劜⑩逄켿搀ⴘɳ쏆 NetSuite saves it like 㚆ࡠ왐¸ὀ阁㌀룀☁᠄ᩈ恅唐聛ō瀿ţ怂Ƅސȭ쀌먈퀁띌鴣Î?큣넕챞劜⑩逄켿搀ⴘɳ... and that makes the decompress function produce null.

Notice how the code \x8D for example was converted to ?

Is there a way to avoid this other than using other methods that produce longer text?

Thanks

Ryc O'Chet · Answer 1 · Wed Nov 29 2023 17:58:43 GMT+0800 (China Standard Time)

Short answer: no

Longer answer: In order to compress to something it needs to be able to decompress it again. When there are "special" codes such as in utf8 then the library would need to understand them all for both sides of that. Unfortunately utf is an evolving standard, and that would mean having it work at the time of coding (probably!), and then break at some point in the future when new utf codes are added and they get special meaning (or conversely unknown codes that get swapped with ? get a glyph and suddenly start behaving differently).

If your code is getting z-lib compressed (ie, browser compression) then it won't make much difference using base64.

Marwan · Answer 2 · Mon Dec 04 2023 05:20:27 GMT+0800 (China Standard Time)

I tried compressToUTF16, and in the cases I ran it through it didn't produce codes like that. So this doesn't mean it will never produce UTF8 codes, right?

What if I replace \x before saving with something unique, then change it back before decompressing? I think this would work if there are some characters that are guaranteed NOT to be generated in the compression output by the lib. Are there such characters?

Ryc O'Chet · Answer 3 · Mon Dec 04 2023 19:11:20 GMT+0800 (China Standard Time)

It's a plain encoded binary data blob - so every character is possible, and that means that whatever you replace it with is also potentially there already - I also don't know NetSuite, but given that most backend techs use UTF8 rather than UTF16 it definitely sounds like a development headache trying to account for these things.

Marwan · Answer 4 · Mon Dec 04 2023 21:17:36 GMT+0800 (China Standard Time)

Got it.

I will use compressToBase64 or compressToEncodedURIComponent, and try to be satisfied with the amount of compression 😢

Thanks for your explansation!

Ludger A. Rinsche · Answer 5 · Fri Feb 02 2024 17:08:12 GMT+0800 (China Standard Time)

compressToCustom and decompressFromCustom should work for your usecase, if you can provide the string of allowed characters.

Marwan · Answer 6 · Mon Feb 05 2024 15:41:39 GMT+0800 (China Standard Time)

compressToCustom and decompressFromCustom should work for your usecase, if you can provide the string of allowed characters.

I tried it, but the length of the output is larger than compressToEncodedURIComponent's output. I thought it would be as small as compress while changing utf-8 codes with one of the characters I am passing.

Ludger A. Rinsche · Answer 7 · Mon Feb 05 2024 17:16:19 GMT+0800 (China Standard Time)

Can you provide the code you used? Since compressToEncodedURIComponent only uses 65 characters, a compressToCustom with almost all utf8 characters should be significantly smaller...

Ryc O'Chet · Answer 8 · Mon Feb 05 2024 18:58:52 GMT+0800 (China Standard Time)

I identified an issue with the custom encoders - using base64 as a proof of concept - marking them as experimental until this is fixed (it's an awkward one - but they're only implemented on here and if not going into something that saves code they can decompress the compressed content safely)

Marwan · Answer 9 · Fri Feb 16 2024 03:12:51 GMT+0800 (China Standard Time)

Sorry for the late reply.
I used ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789

This is a simple example:

LZString.compress(`{"name": "Adam", "age": 30}`)
'㞂⁶ࡠ똊戅쀂တ䀦턀榑Ι쒀찀》'

LZString.compressToEncodedURIComponent(`{"name": "Adam", "age": 30}`)
'N4IgdghgtgpiBcACEBBAJtEAaZEDmcSAzAAwC+QA'

LZString.compressToCustom(`{"name": "Adam", "age": 30}`, `ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789`)
'DrMCKCAikMHoGgtMwyBEUEQ2N48HB3AO1NFWNkUDMXPLa'

Ryc O'Chet · Answer 10 · Tue Feb 20 2024 17:32:32 GMT+0800 (China Standard Time)

@marwan-nwh Check #200 - short answer is that it's deterministic, but not the expected output when using a dictionary that is a perfect binary length - your comparison is also not valid as EncodedURI has extra processing, should be using plain base64