whatwg / html

HTML Standard

Home Page:https://html.spec.whatwg.org/multipage/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Specify how document.cookie diverges from [COOKIES] RFC

domenic opened this issue · comments

Currently the spec says

the user agent must act as it would when receiving a set-cookie-string for the document's address via a "non-HTTP" API, consisting of the new value encoded as UTF-8.

However, in the real world things like document.cookie = "foo" work and have an effect. There are probably many other possibilities; in general the RFC just has a grammar that things might not match, whereas I imagine browsers just accept anything and try to make sense of it, even if it fails to match the grammar.

@bsittler noticed this while working on some service worker cookie stuff, and previously it has come up in the jsdom project and its related tough-cookie helper:

@Sebmaster and @inikulin led the charge for this in jsdom, so maybe they could help us spec the correct behavior for how document.cookie parses cookies? Alternately, looking at open-source browser code would get us pretty far.

This might be a compat issue if everyone hasn't managed to magically converge on a single behavior despite the lack of precise spec. Tentatively tagging as such for now.

I'd love to help!

Here is what we can do:

  • Create test runner for the IETF test suite that will produce output in machine readable format. Currently it can run only individual tests, or requires dev builds of the browsers in some cases which renders it unusable for the testing of IE and Edge. Also it can't produce machine readable reports at the moment. We will need them to aggregate and analyze results across browsers lately. I'm already working on it.
  • Run tests in all major browsers:
    • Chrome
    • Safari
    • Firefox
    • IE
    • Edge
  • Using aggregated test fails info we can build table in format:
Test case /browser Expected Chrome Firefox Safari
"foo=" "" "foo" "foo" ""
  • Triage fails into groups, e.g. if test fails in the majority of the browsers consider it as a de facto behavior and add this difference to the spec. For the minor cases consult with the developers / search for the issue tracker tickets to find motivation behind it.
  • Modify IETF test suite by the way to align it with the proposed behavior. Make it default test suite for the spec.

I will try to provide you test results somewhere around next week.

Cool! Another thing here worth checking is <meta http-equiv=set-cookie>. If these invalid values still result in HTTP headers, it's likely the RFC will need to be updated somehow.

@annevk AFAIK browsers uses the same code for all cookie parsing scenarios. Spec violations in document.cookie setter also shows up when you set cookie via HTTP-header. I'm pretty sure we will have the same results with <meta>.

I see, in that case it seems like something @mikewest and @mnot should be solving in the RFC. Your testing will still be useful, obviously, but given the scope of the problem it does not seem like something that needs to be addressed in the HTML Standard. Although I can understand if we need to make adjustments for a revised RFC that does handle this properly.

Although I can understand if we need to make adjustments for a revised RFC that does handle this properly.

So, we will continue discussion here for now and once we will have some data and analyzis we will ping IETF guys, I guess?

Very good timing. We're about to start opening up the cookie RFC, so yes do ping us when you have some results. Any idea how long that will be?

@inikulin, yeah, we'll keep this open until the issue is resolved. @mnot, @inikulin mentioned earlier he was hoping to have something this week.

OMG, this is amazing!!

@inikulin this is really sobering. Thank you! What was the effective document charset for the test page?

FYI test runner sources are here: https://github.com/inikulin/cookie-compat

Thank you guys for all the kind words, I hope you will find it useful.

Further steps:

  • Add expires= date parsing tests. They are in the separate test suite and requires conversion. (just realized what there is no way to access parsed expiration date)
  • Currently we don't have reference implementation. It bothers me. I will try to create one based on tough-cookie. Actually, tough-cookie is implemented nearly per spec with just some minor relaxations (e.g. symbols restrictions for the token are ignored).
  • Report issues for the obvious bugs to implementors and reference them in the table.

Wow indeed, really great stuff!

It seems to me that the first 17 tests could be brought into (at least rough) interop with a fairly simple spec change to Section 5.2. The remaining tests demonstrate enough interop that they look more like browser bugs to me.

That's assuming that all of the browsers don't want to fix the underlying bugs in the first 17 tests, of course. It'd be very useful to know how much content on the Web currently relies upon this behaviour, but gathering that data is likely to be problematic...

If we do want to change the spec, someone will need to write up an Internet-Draft describing the proposed changes. I can help with that.

@inikulin would you mind pinging the HTTP-WG about this on its mailing list https://lists.w3.org/Archives/Public/ietf-http-wg/? If you don't want to subscribe, I can forward a message for you, or you could even just open up a bug at https://github.com/httpwg/http-extensions/issues. I just want to make sure that you get credit for this awesome work.

@inikulin what was the system codepage for Edge and IE? Have you tried changing it? If https://stackoverflow.com/questions/1969232/allowed-characters-in-cookies is to be believed, non-ASCII characters may "work" in IE when they are present in the system codepage, where "work" means they will be wire-encoded in that codepage (never UTF-8, since Windows system codepage can't be set to 65001) but exposed to JavaScript using the corresponding Unicode characters. I'd be especially interested to see the results for systems with larger-coverage (CJK?) or non-1252 system codepages.

Likewise, have you tried server-generated cookies with encodings other than UTF-8, e.g. latin-1?

Nope, haven't adjusted windows code page for tests. I'll try to run with codepages with bigger character set tomorrow at work, because I don't have access to win machine currently.

Likewise, have you tried server-generated cookies with encodings other than UTF-8, e.g. latin-1?

Nope

One more thought: it may be worth checking both reading and writing behavior of the backslash \u005c \ and yen sign \u00a5 ¥ in cookies on the server side, from HTML (meta http-equiv=set-cookie) and from document.cookie across Latin 1, UTF-8 and Shift JIS/CP 932 document encodings and with both US English and Japanese system codepages in effect. It's a large matrix, but it may uncover some useful information about how browsers currently interoperate (or don't) in the presence of incompatible character encodings. In particular it would be good to know whether backslash is reliably round-tripped under all these circumstances and whether or not it is ever remapped to a non-ASCII character.

Same question for tilde \u007e ~ and wave dash \u301c actually.

(I'm asking these oddly specific questions because I'm wondering whether all of printable ASCII other than semicolon is actually safe in cookie values across browsers) Edit: names too (barring equal sign of course)

Edit: Also, in the meta http-equiv case, are the results the same for raw document-charset characters vs. HTML-entified versions?

more edit: Yet another IE-specific question: does document.cookie in IE (and Edge?) round-trip Unicode when the characters are first converted to bytes? e.g. document.cookie = unescape(encodeURIComponent('test=三猿🙈🙉🙊')) and decodeURIComponent(escape(document.cookie)) [or the (better) TextDecoder/TextEncoder equivalents except there's no TextDecoder/TextEncoder in IE]

@bsittler

I'd be especially interested to see the results for systems with larger-coverage (CJK?) or non-1252 system codepages.

I've added results for IE and Edge with system codepage 950 (big5) and 932 (shift_jis): http://inikulin.github.io/cookie-compat/ (spoiler: it didn't work out)

Regarding #804 (comment) if you wouldn't mind, I will work on it later, because I'm really running out of spare time currently. I've created issue in cookie-compat for this task to not forget about it: inikulin/cookie-compat#3

Thank you very much

On Wed, Jun 22, 2016, 05:21 Ivan Nikulin notifications@github.com wrote:

@bsittler https://github.com/bsittler

I'd be especially interested to see the results for systems with
larger-coverage (CJK?) or non-1252 system codepages.

I've added results for IE and Edge with system codepage 950 (big5) and 932
(shift_jis): http://inikulin.github.io/cookie-compat/ (spoiler: it didn't
work out)

Regarding #804 (comment)
#804 (comment) if you
wouldn't mind, I will work on it later, because I'm really running out of
spare time currently. I've created issue in cookie-compat for this task to
not forget about it: inikulin/cookie-compat#3
inikulin/cookie-compat#3


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#804 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAD3R3OiA9gj3SUrChCOVgrDswuODR8Oks5qOSjVgaJpZM4HpsLj
.

On Windows 7 with a US English system locale running IE 9, JavaScript-written cookies subsequently read from JavaScript seem to reliably round-trip characters whose ISO 8859-1 encodings fall in the ISO 2022 GR range (0xA0 ... 0xFF) in addition to most of printable ASCII. This seems to be the case regardless of the document character encoding. Additionally, I tried a few characters whose Windows-1252 encodings fall in the ISO 2022 C1 range (0x80 ... 0x9F) and they appear to round-trip successfully, too. Characters not representable in Windows-1252 are apparently converted to question mark (other printable characters) or dropped (ASCII control characters.)

I have not yet tested with a different system locale.

I suspect that cookies are simply serialized in the IE cookie jar using the default codepage of the system locale.

Indeed, after switching the system locale to Japanese (with "ANSI" and "OEM" codepages both switched to 932) and rebooting, cookies behave exactly as if they are being stored in CP932 (approximately Shift JIS), with characters like Euro sign \u20ac converted to question mark and japanese text preserved. This is independent of document charset, so the same Japanese text written by script running in a Shift JIS document is readable by script running in a UTF-8 document without mangling, and vice versa.

Wow, that is not something we want to standardize upon. How would that even work with code points that cannot be represented by the encoding?

It doesn't. They are converted to question marks (in other words, data is
lost.) Because it's based on the system "ANSI" code page it is however
somewhat likely that text entered by the user in the system locale's
primary language will round-trip successfully from script to script across
page loads. Compatibility with other modern browsers however seems to be
zero for non-ASCII text.

On Tue, Jun 28, 2016, 00:02 Anne van Kesteren notifications@github.com
wrote:

Wow, that is not something we want to standardize upon. How would that
even work with code points that cannot be represented by the encoding?


You are receiving this because you commented.

Reply to this email directly, view it on GitHub
#804 (comment), or mute
the thread
https://github.com/notifications/unsubscribe/AAD3R0T0ufS3iGTcdq_8a_H49eZyMzn0ks5qQMcEgaJpZM4HpsLj
.

Just did a little further testing, and verified that even with explicit UTF-8 or UTF-16 (little-endian) byte-order marks in the cookie name and/or cookie value, IE and Edge still always interpret the cookie according to the system "ANSI" codepage. Non-ASCII cookie names and values set by the server are sent back to the server without mangling, so there's nothing to prevent a server from storing UTF-8 in a cookie (e.g. UTF-8 cookie names/values containing Ő [\xc5\x90] round-trip server-to-server via US English-locale Edge even though \x90 is nominally unmapped in Windows code page 1252), however scripts running in IE always misinterpret such cookies according to the system ANSI codepage (in this case the nominally unmapped byte is in fact exposed as-is to script, as '\x90'.)

Also, attempts to set cookies from scripts with "ANSI" code page-unrepresentable characters in their names and/or values do not always convert those to question marks - sometimes a different fallback is used. For instance, with a US English system locale document.cookie = 'Ő=Ő' results in O=O instead. I suspect it's using the default substitutions from WideCharToMultiByte.

I'm doubtful that further testing of IE/Edge's quirks is going to be helpful. We know they do weird stuff they would never put into a web spec.

Right, I was merely attempting to assess the compatibility risk of having the new API only support UTF-8 (and possibly also "raw byte array") interpretation for cookie data, which would be incompatible (in Edge) with the system "ANSI" codepage interpretation in document.cookie and <meta http-equiv="set-cookie" ...> but consistent with other browsers.

One "fun" thing I noticed today: document.cookie = 'foo' will add a trailing = in macOS WebKit, but not GTK+ WebKit.