Add User-Agent sniffing

Question

Add User-Agent sniffing

twifkak opened this issue 2 years ago · comments

Chromium M73 enabled SXG, but before this commit landed in M79, it had an Accept header that matches the PrefersSxg logic that is meant to differentiate SXG crawlers from SXG browsers, per this recommendation.

Without User-Agent sniffing to rule out Chromium M73-78, sxg-rs may result in broken experiences on those browsers on sites that use a service worker caching strategy, due to Chromium not interpreting SXG served from a SW. Those browsers represent ~0.19% of users.

(This doesn't happen on Cloudflare Automatic Signed Exchanges because they've implemented this UA sniffing downstream.)

Devin Mullins · Answer 1 · Wed Oct 19 2022 05:58:19 GMT+0800 (China Standard Time)

This should be a "denylist" approach -- allowing all user agents except those that match Chromium 73-78.

Devin Mullins · Answer 2 · Wed Oct 19 2022 06:18:23 GMT+0800 (China Standard Time)

I found a few UA parsers in crates, in descending popularity:

We should rule out the huge ones, since we have serverless bundle size constraints.

Perhaps a regex will suffice for our needs? Famous last words.

Quang Luong · Answer 3 · Thu Oct 20 2022 00:08:33 GMT+0800 (China Standard Time)

I think a small heuristic without regex should be the way to go. For example, find Chrome/ or Chromium/ then get the next 2 digits to see if it is in [73, 78] range.

Devin Mullins · Answer 4 · Thu Oct 20 2022 04:10:19 GMT+0800 (China Standard Time)

Perhaps I'm a bit too forward-looking but there may eventually be a Chrome 730. 😉

Are you concerned around new dependencies because of bundle size, security, performance, or something else?

In general, I prefer accurate parsers because inaccurate ones lead to weird ecosystem effects:

sites begin to target bugs in the parsers, either deliberately (e.g. the IE star hack) or accidentally, and then
changes to the standard need to keep in mind backwards-compatibility with these hacks, and then
the semantics get progressively more complex (e.g. HTML adoption agency algorithm and others leading to cases where you can construct invalid DOMs)

That said, I'm also a pragmatist. In particular, sxg-rs is a drop in the bucket -- it probably won't materially effect the destiny of User-Agent. 😁 So if a small heuristic is preferable, that's fine.

Boxiao Cao · Answer 5 · Thu Oct 20 2022 08:22:52 GMT+0800 (China Standard Time)

user-agent-parser, uap-rust and fast_uapperser are all internally using regular expressions defined in uap-core.

Picking the Chrome-related regex in uap-core is suffice to parse the Chrome major version.

However, adding regex crate as dependency increases the gzipped WASM size from 737 kB to 954 kB.

Devin Mullins · Answer 6 · Thu Oct 20 2022 08:29:35 GMT+0800 (China Standard Time)

Oh, thanks for the investigation @antiphoton. I overestimated the degree to which UA strings can be accurately parsed. I'll withdraw my earlier comment (except to be mindful of more than two digits).

Maybe something like " Chrome/" or " Chromium/" (note the leading space) followed by any number of digits? No need for regex then, I suppose? std::string or nom will probably suffice?