canonicalize modifies an unencoded string

Question

canonicalize modifies an unencoded string

krinsane opened this issue 12 years ago · comments

Krimy Amichandwala commented 12 years ago

In other words, it thinks that a string is encoded when it is actually not and therefore if I do something like

$.encoder.encodeForHTML($.encoder.canonicalize(string)), it gives me a different string

The string in question is something like this: "sdf\sdf\sdf"

Canonicalize transforms it into this: sdf�sdf�sdf

Chris Schmidt · Answer 1 · Fri Mar 02 2012 22:45:59 GMT+0800 (China Standard Time)

Not sure there is a way around this problem, \s will be considered a control character and will be decoded by canonicalization. Even if you were to do \s and escape the \ it would still be normalized and decoded on the subsequent pass. Is there any other character that could be used in the place of the backslash which is the control character marker for most programming languages? Changing the encoder would allow an attacker to pass control characters using multiple encoding attacks which is less than ideal.

Krimy Amichandwala · Answer 2 · Sat Mar 03 2012 08:51:34 GMT+0800 (China Standard Time)

I think a way around this is to provide another API for canonicalize for code. The use case I have is a regex is typed into an input field. So people can choose which canonicalize function to use for values where code is expected. The same encoder is fine.

Krimy Amichandwala · Answer 3 · Sat Mar 03 2012 08:55:38 GMT+0800 (China Standard Time)

To continue on my last comment:

Lets say I have a wrapper function encodeForCode

it will have the following:

encodeForCode {
$.encoder.encodeForHTML($.encoder.canonicalizeForCode(string));
}

Chris Schmidt · Answer 4 · Wed Mar 21 2012 09:10:00 GMT+0800 (China Standard Time)

By it's nature canonicalization is intended to reduce a string to it's simplest form, that is to replace any escaped characters with their character representations so there is only 1 canonicalize function. Not sure I see a use for more than that. I can however see a use-case for allowing customization of the codecs that are used for canonicalization.

So basically you would be able to customise the behavior of canonicalization and what it interprets as a control character.

Like this

function encodeForCode(strInput) {
   $.encoder.encodeForHTML($.encoder.canonicalize({input: strInput, codecs: [ new HTMLEntityCodec(), new PercentCodec() ]});
}

This would eliminate the from being interpreted as a control character and canonicalized as this is a CSS escaping syntax

Chris Schmidt · Answer 5 · Sat Dec 12 2015 04:58:01 GMT+0800 (China Standard Time)

Hey @stuartf - trying to close the loop on some of these older issues. Does the suggested fix accommodate your requirements?

D. Stuart Freeman · Answer 6 · Tue Dec 15 2015 05:45:40 GMT+0800 (China Standard Time)

@nicolaasmatthijs @simong did we work around this somehow, or is it still a problem for oae?

desertdev · Answer 7 · Fri May 06 2016 02:09:41 GMT+0800 (China Standard Time)

I just ran into the exact same problem today when a legitimate user input string contained backslashes (an attempt to share a windows file path, eg "c:\ext").

Going to look into the above suggestion by @chrisisbeef and will post the outcome.

desertdev · Answer 8 · Fri May 06 2016 03:11:25 GMT+0800 (China Standard Time)

Avoiding the CSSCodec in the canonicalize function worked for me.

Note to anyone else experiencing this problem:
The example code above by @chrisisbeef is an incomplete hypothetical customization. The current canonicalize function has the codecs var hard-coded to use all 3 codecs. If you want to pass in different codecs as in the example above, the canonicalize function also needs to be modified.