Requirements for user-provided dictionaries

Question

Requirements for user-provided dictionaries

Yoric opened this issue 5 years ago · comments

With the current context 0.1 format (reference implementation), we already achieve a compression level ~5% better than Brotli on Facebook, assuming a site-specific dictionary.

On the assumption that we are heading towards user-provided dictionaries, I'm opening this issue to discuss the requirements for such dictionaries.

Stuff that I believe everybody agrees on

(will be updated)

We want site-specific dictionaries, rather than e.g. framework-specific dictionaries.
The compressed files need to reference their dictionary.
The reference is probably not a URL.
We want the site to send a HTTP header signaling the browser that it needs to start fetching the dictionary, without waiting for the compressed files to be received.
A single file should be able to use several dictionaries.

Open questions

Can we confirm that compression works as well for other websites?
What does "site-specific dictionary" mean, exactly? Which part of the browser is supposed to enforce this policy?
Do we want to need same-origin?

David Teller · Answer 1 · Fri Jun 14 2019 16:47:54 GMT+0800 (China Standard Time)

Requirement: Avoiding side-channel attacks.

We do not want http://evil.gov to be able to find out whether you have been visiting http://amnesty.org by detecting how long it takes your browser to download http://amnesty.org/dictionary.

Mechanism

Most web browsers are moving towards a mechanism to isolate caches across origins to avoid such side-channel attacks. I believe that this means any attempt to use a dictionary for e.g. the same framework across all origins would be counter-productive.

This isolation mechanism should be necessary and sufficient for our needs.

Conclusion

No blocker, as long as we concentrate on site-specific dictionaries.

Open questions

Do we need to further require same origin for the dictionary and the HTML document?
How does this work for CDNs?

Dominic Cooney · Answer 2 · Mon Jun 17 2019 09:00:28 GMT+0800 (China Standard Time)

On security, there's an attack specific to compression where an attacker observes/infers the compressed length of a transfer and uses that to infer something about the content being transferred; see HEIST. With a dictionary the disclosure could be of content compressed with the dictionary, or the content of the dictionary itself.

One simple mitigation would be use a credential-less fetch for the dictionary, which should discourage sites from putting sensitive strings in the dictionary.

I think there's a few practical issues to consider:

The dictionary to use needs to be specified with the file. I think the best choice is to have a HTTP header indicating the URL to retrieve the dictionary from. This makes more sense to me than baking URLs into files, because it decouples producing files and dictionaries from deploying them. That lines up with the status quo on the web—I can get a minified threejs from GitHub and deploy it on my server without changing the content of the file. However dictionary loading works, we should work out advice to developers about how to use push/prefetch/etc. correctly.

Should files have one or multiple external dictionaries? I expect string change frequency follow some exponential distribution, so it might make sense to have a couple of dictionaries: one for very slowly changing strings, another for shared but recently changed strings, etc. From a format perspective this is straightforward. It would complicate how indicating external dictionaries worked though.

The developer needs an easy way to debug files being paired with the wrong dictionary. The format doesn't have any mechanism to check that it's being decompressed with the correct dictionary. I think this is separate to the security-sensitive problem of dictionary index out-of-bounds type stuff; rather it's a developer ergonomics issue. If a BinAST file fails to decode as expected because it was paired with the wrong dictionary, it would be nice to give the developer a message on the devtools console. I think having a few bytes in the dictionary and a few bytes in the format which the browser matches should be sufficient. It is tempting to hash the dictionary but this prevents scenarios like localisation, sending debug dictionaries with verbose error messages, and complicates some content addressable stores which want to hash static resources in a batch.

Do we need to further require same origin for the dictionary and the HTML document?

I don't think this would work well because JavaScript is usually a static resource, and static resources are often served from CDNs or cookieless domains—and hence they are cross origin resources. Requiring the dictionary to be served from the same origin would complicate that kind of serving.

With the current context 0.1 format (reference implementation), we already achieve a compression level ~5% better than Brotli on Facebook, assuming a site-specific dictionary.

There are a lot of ways to measure performance when talking about shared dictionaries:

The size of a set of related resources, including the shared dictionary. This is where the 5% win number comes from for us. This is very relevant to developers because it's the practical cost of a set of related requests with an empty cache.
The size of a set of related resources, not including the shared dictionary. This is probably relevant to developers; it's the cost of fetching new resources but reusing the dictionary. For the resources I mentioned earlier which are a 5% win for the first fetch, BinAST is a 10% win for the second fetch. (From this perspective a 5% improvement is underselling the format a bit.)
The size of a file with its dictionary. This may be relevant to developers because it's an upper bound on the cost of a clean-cache request with no sharing. With the reference implementation, this is typically a regression.

Getting compression data from more sites would be good. Those should use samples of JavaScript over time. Having data from a browser vendor about cache performance is a critical input to a model.

David Teller · Answer 3 · Mon Jun 17 2019 21:59:53 GMT+0800 (China Standard Time)

Getting compression data from more sites would be good. Those should use samples of JavaScript over time.

Yes, that's one of the reasons I started https://github.com/Yoric/real-js-samples .

Having data from a browser vendor about cache performance is a critical input to a model.

What data do you need?

David Teller · Answer 4 · Mon Jun 17 2019 23:46:38 GMT+0800 (China Standard Time)

On security, there's an attack specific to compression where an attacker observes/infers the compressed length of a transfer and uses that to infer something about the content being transferred; see HEIST. With a dictionary the disclosure could be of content compressed with the dictionary, or the content of the dictionary itself.

(ok, I need to do some more reading about HEIST)

One simple mitigation would be use a credential-less fetch for the dictionary, which should discourage sites from putting sensitive strings in the dictionary.

That strikes me as hard/impossible to enforce.

edit I had initially written "not hard". This was a typo, I meant the opposite.

I think there's a few practical issues to consider:

The dictionary to use needs to be specified with the file. I think the best choice is to have a HTTP header indicating the URL to retrieve the dictionary from.

Agreed.

However dictionary loading works, we should work out advice to developers about how to use push/prefetch/etc. correctly.

Agreed.

Should files have one or multiple external dictionaries? I expect string change frequency follow some exponential distribution, so it might make sense to have a couple of dictionaries: one for very slowly changing strings, another for shared but recently changed strings, etc. From a format perspective this is straightforward. It would complicate how indicating external dictionaries worked though.

Agreed on both accounts. Waiting for feedback from others.

The developer needs an easy way to debug files being paired with the wrong dictionary. The format doesn't have any mechanism to check that it's being decompressed with the correct dictionary. I think this is separate to the security-sensitive problem of dictionary index out-of-bounds type stuff; rather it's a developer ergonomics issue. If a BinAST file fails to decode as expected because it was paired with the wrong dictionary, it would be nice to give the developer a message on the devtools console. I think having a few bytes in the dictionary and a few bytes in the format which the browser matches should be sufficient. It is tempting to hash the dictionary but this prevents scenarios like localisation, sending debug dictionaries with verbose error messages, and complicates some content addressable stores which want to hash static resources in a batch.

Agreed that we somehow need to pair dictionary and files and display a correct error message.
Good point about localization. I assume that we don't want to force webdevs to use the same URL for all locales, so we can't use a URL for identification, either.

Do we need to further require same origin for the dictionary and the HTML document?

I don't think this would work well because JavaScript is usually a static resource, and static resources are often served from CDNs or cookieless domains—and hence they are cross origin resources. Requiring the dictionary to be served from the same origin would complicate that kind of serving.

Agreed. Waiting for @RReverser's input on this point.

With the current context 0.1 format (reference implementation), we already achieve a compression level ~5% better than Brotli on Facebook, assuming a site-specific dictionary.

There are a lot of ways to measure performance when talking about shared dictionaries:

Ok, let's discuss this in another issue :)

David Teller · Answer 5 · Wed Jun 19 2019 04:33:47 GMT+0800 (China Standard Time)

On security, there's an attack specific to compression where an attacker observes/infers the compressed length of a transfer and uses that to infer something about the content being transferred; see HEIST. With a dictionary the disclosure could be of content compressed with the dictionary, or the content of the dictionary itself.

Ok, did some reading on HEIST. In general, I agree that we want to avoid storing confidential data in the dictionary, but I don't really see how to enforce this. This is pretty much equivalent to disallowing storing confidential data in JS code, right?

In the specific case of HEIST, wouldn't it be a better protection if we added (or allowed) the addition of a random number of random bytes at the end of the dictionary?

David Teller · Answer 6 · Wed Jun 19 2019 08:51:43 GMT+0800 (China Standard Time)

Ok, I just had a conversation with @martinthomson (Mozilla security) about HEIST. As far as I can tell, since we're not compressing substrings, BinAST itself should not create HEIST issues that do not already exist. What can create issues, on the other hand, is the brotli post-compression, if it is applied to a file that contains both user-controlled data and confidential data, whether it's a/ a dictionary; b/ a compressed file.

a/ I have difficulties imagining webdevs creating a dictionary that contains either user-controlled data or confidential data, much less both.

b/ Similarly, I have difficulties imagining webdevs compressing a JS file that contains either user-controlled data or confidential data, much less both. On the other hand, once we have a fast encoder, it is quite possible that webdevs could use BinAST to compress a JSON file that contains both. While we could side-step the issue by refusing to compress JSON, that would probably just cause webdevs to hide the JSON as JS, which would be even worse.

A suggestion by @martinthomson would be to add an encoder command-line flag, to let the webdev specify whether the file contains user-controlled data and whether the file contains confidential data. If both are specified, we may still encode with BinAST, but not with brotli. To discourage the webdev from applying brotli regardless, we may wish to move brotli compression inside the file.

Regardless, as you mention, @dominiccooney, we should make clear for webdevs that mixing user-controlled and confidential data a bad idea.

David Teller · Answer 7 · Thu Jun 20 2019 01:59:19 GMT+0800 (China Standard Time)

Let's talk about seeding the dictionary (i.e. the first fetch).

Depending on network performance/contention, I believe that there may be two cases.

In an ideal network, the server may want to send the dictionary and use it immediately.
In an awful network, the server may want to send non-BinAST data and (later) send the dictionary for use in the next load.

Case 1. is fairly easy to specify. Case 2. is more complicated, as it may require additional push-style HTTP headers and/or something like an extension to <script async>, cache information, etc.

I believe that we should concentrate on case 1. for the moment.