pieroxy / lz-string

LZ-based compression algorithm for JavaScript

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Can utf-8 codes be avoided in the output?

marwan-nwh opened this issue · comments

Hi,

I am using this library in an ERP system called NetSuite.

When I get an output from LZString.compress like 㚆ࡠ왐¸ὀ阁㌀룀☁᠄ᩈ恅唐聛ō瀿\x18ţ怂Ƅސȭ쀌먈\x8D퀁띌鴣Î\uDC71큣\x0E넕챞劜⑩逄켿搀ⴘɳ쏆 NetSuite saves it like 㚆ࡠ왐¸ὀ阁㌀룀☁᠄ᩈ恅唐聛ō瀿ţ怂Ƅސȭ쀌먈퀁띌鴣Î?큣넕챞劜⑩逄켿搀ⴘɳ... and that makes the decompress function produce null.

Notice how the code \x8D for example was converted to ?

Is there a way to avoid this other than using other methods that produce longer text?

Thanks

Short answer: no

Longer answer: In order to compress to something it needs to be able to decompress it again. When there are "special" codes such as in utf8 then the library would need to understand them all for both sides of that. Unfortunately utf is an evolving standard, and that would mean having it work at the time of coding (probably!), and then break at some point in the future when new utf codes are added and they get special meaning (or conversely unknown codes that get swapped with ? get a glyph and suddenly start behaving differently).

If your code is getting z-lib compressed (ie, browser compression) then it won't make much difference using base64.

I tried compressToUTF16, and in the cases I ran it through it didn't produce codes like that. So this doesn't mean it will never produce UTF8 codes, right?

What if I replace \x before saving with something unique, then change it back before decompressing? I think this would work if there are some characters that are guaranteed NOT to be generated in the compression output by the lib. Are there such characters?

It's a plain encoded binary data blob - so every character is possible, and that means that whatever you replace it with is also potentially there already - I also don't know NetSuite, but given that most backend techs use UTF8 rather than UTF16 it definitely sounds like a development headache trying to account for these things.

Got it.

I will use compressToBase64 or compressToEncodedURIComponent, and try to be satisfied with the amount of compression 😢

Thanks for your explansation!

compressToCustom and decompressFromCustom should work for your usecase, if you can provide the string of allowed characters.

compressToCustom and decompressFromCustom should work for your usecase, if you can provide the string of allowed characters.

I tried it, but the length of the output is larger than compressToEncodedURIComponent's output. I thought it would be as small as compress while changing utf-8 codes with one of the characters I am passing.

Can you provide the code you used? Since compressToEncodedURIComponent only uses 65 characters, a compressToCustom with almost all utf8 characters should be significantly smaller...

I identified an issue with the custom encoders - using base64 as a proof of concept - marking them as experimental until this is fixed (it's an awkward one - but they're only implemented on here and if not going into something that saves code they can decompress the compressed content safely)

Sorry for the late reply.
I used ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789

This is a simple example:

LZString.compress(`{"name": "Adam", "age": 30}`)
'㞂⁶ࡠ똊戅쀂တ䀦턀榑Ι쒀찀》'

LZString.compressToEncodedURIComponent(`{"name": "Adam", "age": 30}`)
'N4IgdghgtgpiBcACEBBAJtEAaZEDmcSAzAAwC+QA'

LZString.compressToCustom(`{"name": "Adam", "age": 30}`, `ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789`)
'DrMCKCAikMHoGgtMwyBEUEQ2N48HB3AO1NFWNkUDMXPLa' 

@marwan-nwh Check #200 - short answer is that it's deterministic, but not the expected output when using a dictionary that is a perfect binary length - your comparison is also not valid as EncodedURI has extra processing, should be using plain base64