Add an `inline-ar` session for handheld AR

Question

Add an `inline-ar` session for handheld AR

tangobravo opened this issue 2 years ago · comments

I have argued in #77 that whilst the current full-screen immersive-ar implementations on mobile platforms are a useful way of providing device-agnostic content, it is not sufficient for many real-world handheld AR applications.

An alternative API (as proposed in #78) that more closely maps to the native ARCore / ARKit implementations and provides a stream of camera frames with pose metadata should be able to cover all the requirements for any handheld AR experience.

However one positive benefit of the immersive-ar session type is that the browser takes responsibility for the composition of the camera frame, which prevents the site from needing to access the raw camera data, and allows users to grant more limited permissions.

For the subset of handheld AR experiences that don't need full camera access but still want more control over the presentation of the content, I propose introducing an inline-ar session type to the spec.

This would wrap an in-DOM WebGL canvas element in much the same way as the current inline session. Compositing of the camera frame would be handled by the browser and the camera frame data would not be accessible to the site, so the same permission treatment as immersive-ar sessions could be used.

Implementation-wise, I imagine the browser's page compositor rendering the camera frame to a layer directly underneath the canvas layer containing the content, but conceptually "attached" to it (ie sharing CSS transforms, and potentially even things like CSS blur effects).

The XRSession's rAF loop would run at the camera rate, and callbacks would be defined to be delivered to the page in advance of window.requestAnimationFrame callbacks. That allows content to decide if it wants to render updates at the screen or camera rate, but ensure the pose data is always synchronized with the camera frame that will be rendered.

An example sequence of callbacks that might occur with a 30 FPS camera feed on a 60 FPS screen would be:

Screen Refresh 1: xrSession_rAFcb
Screen Refresh 1: window_rAFcb
Screen Refresh 2: window_rAFcb
Screen Refresh 3: xrSession_rAFcb
Screen Refresh 3: window_rAFcb
Screen Refresh 4: window_rAFcb

Any WebGL commands in any of those callbacks that affect the default framebuffer would trigger the browser compositor to update the canvas in the page - layering the results of that WebGL command stream on top of the camera frame passed to the most recent xrSession_rAFcb.

Conceptually this feels like a natural extension to the existing WebXR spec; immersive-ar gives something that works everywhere and is presented outside of the DOM, but it's a relatively small change if a site wants to instead leverage inline-ar on handheld devices to retain control over presentation (to allow CSS to limit the aspect ratio of the canvas or suchlike). I'd expect everything that works in an immersive-ar session to also work in inline-ar, including transient inputs and hit tests.

My guess is that implementation of this might be more complex than the camera-ar proposal in #78, which would also allow presenting the content inside the DOM along with covering the full set of use-cases considered for the raw-camera-access specification.

This inline-ar approach has two main benefits - more complete support for the XRInput system (handy for device-agnostic content that may be displayed in immersive-ar on headsets too) and a reduction in the required permissions.

On the permissions it's important to note that limiting the permissions granted to inline-ar (or immersive-ar) sessions isn't really about protecting users from sites with evil intentions - those sites can still obtain full camera access via getUserMedia or whatever form raw-camera-access will eventually take. It's also likely the case that any sort of 6-DoF tracking will still involve a permission prompt as WebXR sessions necessarily expose some data which could be used for fingerprinting.

The reduction in permissions allows a well-intentioned site to ask for no more access than it requires, and therefore to guarantee to the user the site cannot do evil things with the camera data. A suitably-informed user who understands the subtle distinction between the permission requests may decide to make different choices for an untrusted site requesting only tracking data vs full camera access. In practice I wonder how easy it really is to communicate the distinction to users, and how many would make a different choice.

Of course I still very much support a privacy-focussed model where sites don't get access to more data than they need. When it comes to handheld AR though I suspect very many applications will at some point require camera access, even if only to allow easy capture and sharing via the Web Share API.

Simon Taylor · Answer 1 · Wed May 18 2022 22:27:32 GMT+0800 (China Standard Time)

@toji At the F2F you shared some historical background - please correct me if I'm wrong here:

The Chromium team initially implemented an inline-ar session type that rendered into a canvas in the DOM without a full-screen mode switch
The UX team felt a camera appearing within a page was jarring for users and would lead to privacy questions, and preferred the full-screen mode switch

Assuming my memory on that is right, the thing I don't really understand is why a full-screen mode switch makes this any less jarring? Any logic around permission prompts or requiring user gestures to start the session could be the same between inline and fullscreen session types. Therefore surely it's strictly "more jarring" to also switch to full-screen mode? As well as the camera appearing and raising privacy questions (shared concern for inline or fullscreen), all the browser UI (URL bar etc) and system UI (both nav and status bar) also disappears.

I had seen historical references to Chromium's inline-ar session and I think in that initial implementation it was possible to access the camera with gl.readPixels (as the camera frame was just automatically rendered into the webgl framebuffer). So before the F2F I was imagining that it was easier to implement the privacy-preserving browser-side composition in a fullscreen view rather than within the DOM, and that was the main reason for the change.