google / sxg-rs

A set of tools for generating signed exchanges at serve time.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add User-Agent sniffing

twifkak opened this issue · comments

Chromium M73 enabled SXG, but before this commit landed in M79, it had an Accept header that matches the PrefersSxg logic that is meant to differentiate SXG crawlers from SXG browsers, per this recommendation.

Without User-Agent sniffing to rule out Chromium M73-78, sxg-rs may result in broken experiences on those browsers on sites that use a service worker caching strategy, due to Chromium not interpreting SXG served from a SW. Those browsers represent ~0.19% of users.

(This doesn't happen on Cloudflare Automatic Signed Exchanges because they've implemented this UA sniffing downstream.)

This should be a "denylist" approach -- allowing all user agents except those that match Chromium 73-78.

I found a few UA parsers in crates, in descending popularity:

We should rule out the huge ones, since we have serverless bundle size constraints.

Perhaps a regex will suffice for our needs? Famous last words.

I think a small heuristic without regex should be the way to go. For example, find Chrome/ or Chromium/ then get the next 2 digits to see if it is in [73, 78] range.

Perhaps I'm a bit too forward-looking but there may eventually be a Chrome 730. 😉

Are you concerned around new dependencies because of bundle size, security, performance, or something else?

In general, I prefer accurate parsers because inaccurate ones lead to weird ecosystem effects:

  1. sites begin to target bugs in the parsers, either deliberately (e.g. the IE star hack) or accidentally, and then
  2. changes to the standard need to keep in mind backwards-compatibility with these hacks, and then
  3. the semantics get progressively more complex (e.g. HTML adoption agency algorithm and others leading to cases where you can construct invalid DOMs)

That said, I'm also a pragmatist. In particular, sxg-rs is a drop in the bucket -- it probably won't materially effect the destiny of User-Agent. 😁 So if a small heuristic is preferable, that's fine.

user-agent-parser, uap-rust and fast_uapperser are all internally using regular expressions defined in uap-core.

Picking the Chrome-related regex in uap-core is suffice to parse the Chrome major version.

However, adding regex crate as dependency increases the gzipped WASM size from 737 kB to 954 kB.

Oh, thanks for the investigation @antiphoton. I overestimated the degree to which UA strings can be accurately parsed. I'll withdraw my earlier comment (except to be mindful of more than two digits).

Maybe something like " Chrome/" or " Chromium/" (note the leading space) followed by any number of digits? No need for regex then, I suppose? std::string or nom will probably suffice?