immersive-web / proposals

Initial proposals for future Immersive Web work (see README)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Eye Tracking

msub2 opened this issue · comments

With the release of the Quest Pro and eye tracking slowly becoming available to more and more users, perhaps it'd be worth looking into implementing support for eye tracking in WebXR. There's been some discussion in the past, such as in #25 and the latter comments of #70, but eye tracking was still restricted to a very small subset of headsets back then. There's an existing OpenXR Extension that appears to already expose what I would expect from an eye tracking API (namely, where the user's eyes are looking), so I imagine this could be integrated into an existing browser's OpenXR implementation. As mentioned in the past it would also be prudent to have this be a separate permission, similar to how hand tracking is treated currently.

What are the use cases you have in mind for eye tracking?

Off the top of my head:

  • Social - Perhaps a bit of a minor usecase, but as immersive social/multiplayer platforms become more prevalent on the web (Hubs, FRAME, VrLand, Webaverse, Hyperfy, etc) there are certain to be a subset of users that wish to enhance embodiment of their avatars, and eye tracking helps to add to that (along with facial tracking, but that's a separate proposal entirely)
  • Research - I can imagine having access to eye gaze data would be useful for carrying out research, perhaps in the context of UI/UX design to help gauge where a user might be looking the most, whether they're able to find something immediately or have to search for it, etc.
  • Interaction - Having access to eye gaze data would allow for some interesting ways of interaction, such as being able to focus on an immersive UI element and activating it simply by looking (or better yet, a HUD element that would be overlaid on your field of vision). Technically these things could be done with head gaze as well, but eye gaze would be quicker and require less effort on the part of the user, especially if they are interacting with multiple objects of this nature in close proximity.
  • Rendering - I imagine having access to eye gaze data would also make it possible to perform dynamic foveated rendering in WebXR experiences, though I admit I'm not sure whether that responsibility would fall to the headset itself or the experience.

We've had previous discussions about eye tracking in the group, for example at the Feb 2020 f2f, and I think there were serious concerns about the privacy impact of exposing eye tracking data through a web API.

Wherever possible, I think it would be good to investigate alternatives where the only the user agent gets access to eye tracking data. For example, in the interaction example, the UA could synthesize an XR (or DOM) input event once the user has activated an element, without exposing all the remaining eye movement data to the JS application. You've already mentioned that foveated rendering might be better handled by the UA, though even that may expose some information through a timing channel.

For the social aspect, it would be neat if the UA could apply a rotation to an eyeball object based on eye tracking data without exposing that data to the app, but that doesn't seem feasible for a multiuser application where such data needs to be shared between instances. (I guess we could pass around encrypted pose matrices that can only be decoded by the UA via a WebGL/WebGPU "encrypted uniform" extension, but that seems rather complicated.)

I can definitely understand the concerns about security and the desire to separate the raw movement data from actual input to the application. In your hypothetical though, I'm not sure how one would go about determining when to fire an input event if the framework (let's say something like three.js) doesn't know which object you're looking at to activate in the first place. It seems to me as though meaningful interaction with an arbitrary scene would require you to give it your actual gaze direction.

There are 2 types of eye tracking.
The first one returns the pose of the location that the user is focusing on. I'm unsure how private that information is since it doesn't reveal different information than what you get with controllers or hands. This falls into @msub2 's use cases for research and interaction

The second one returns the position and orientation of the eyes. This one seems much more sensitive (ie return the user's unique IPD) so it requires more mitigations. For instance, do we really need to know the exact orientation of the eyes or can we apply generous rounding? Do we need to position of the eyes at all? This would be for the social use case.

I agree that eyed tracked foveated rendering must handled by the UA. I'm unsure how much information can be learned by timing the rendering.

If this is possible to use in native apps where it is already a privacy concern, and users need to accept a permission, what's the difference between supporting it on the Web?

If this is possible to use in native apps where it is already a privacy concern, and users need to accept a permission, what's the difference between supporting it on the Web?

The web has a higher bar than apps because apps have to comply with a more rigorous process before they are deployed to stores. You can reach any website through your browser but you can't install just any app.
Meta's Quest Pro supports eye and face tracking and we want to find a privacy preserving way to expose this on the web. We need to consensus in the group that this is a desired feature and then start drafting an explainer. We likely need a separate repos for eye and face tracking.

You can reach any website through your browser but you can't install just any app.

I see, thanks for the explanation. As PWAs are now allowed on the store, would this be a possible compromise for Web-based applications to access these sensors?

You can reach any website through your browser but you can't install just any app.

I see, thanks for the explanation. As PWAs are now allowed on the store, would this be a possible compromise for Web-based applications to access these sensors?

As far sa I know, PWAs have to declare their permissions in the manifest, but they still have to request them from the user. I suspect the requirements will be the same as for regular websites.

Yup, I agree they should definitely request permission. I'm looking for ways for PWAs to use these APIs to be on level with native apps, and if the limiting factor is being an app accepted via a store, then maybe PWAs available on the store could provide access to the APIs?

Sorry if I'm misunderstanding something. I think eye and facial tracking are very interesting social features and as a Web developer, I would love to make use of them if possible

I wrote down a very basic spec on how eye and face tracking can be implemented: https://cabanier.github.io/webxr-face-tracking-1/
Here's the README. Comments welcome :-)

@cabanier From the readme:

This technology will NOT:

  • [...]
  • give precise information where the user is looking

[...] This API will define an extensive set of expression and will on a per frame basis, report which ones were detected and how strong they are [...]

enum XRExpression {
  [...]
  "eyes_closed_left",
  "eyes_closed_right",
  "eyes_look_down_left",
  "eyes_look_down_right",
  "eyes_look_left_left",
  "eyes_look_left_right",
  "eyes_look_right_left",
  "eyes_look_right_right",
  "eyes_look_up_left",
  "eyes_look_up_right",
  [...]
}

[...] the user agent must ask the user for their permission when a session is asking for this feature (much like WebXR and WebXR hand tracking). In addition, sensitive values such as eye position must be rounded.


Some questions that I think might be relevant to the design here:

  • How common would it be for an app to ask for low-precision face tracking, but not ask for body tracking?
    • How much personal info/entropy does body tracking expose relative to the low-precision face tracking? Seems like it'd be a lot, given that we all stand in slightly different positions, have different idiosyncratic movements, etc. -- and most importantly, trackers can accumulate more information over time (versus something like IPD, which is just a single one-off value).
  • Similar question for voice/microphone and high-precision hand tracking. I'd guess that voice will almost immediately ~uniquely identify you, but even with a voice changer, my guess is that a few minutes of talking would give away a huge amount of information based on the words you say and the particular way you say them.

Multiplayer WebXR experiences are, I think, going to be increasingly "high-bandwidth-between-users" applications (kind of like a full-body video call with a mask), so I think it'll be inherently quite hard to make the 'average' multiplayer WebXR experience 'private' - at least if you're up against an organisation that professionally tracks users.

So in general what I wonder is how useful and, practically used by devs a low-precision API will be, given that it still requires a permission request, and given all the other info (voice, body, etc.) that the user is likely already giving (which would pretty easily ~uniquely identify them, and so reduces reluctance to give further permissions).

It seems like the aim here is to have a commonly-used low-precision API, and then (I'm guessing) a higher-precision API for certain situations, but it seems like most applications will want the higher precision API. I'm trying to think of a common use cases where the dev would request the low-precision API.

All of that said, is obviously a good idea to give devs/users the ability to only request/give exactly the amount of information they need, and no more, so I think a low-precision API is a good idea in that sense. I'm just wondering how things will actually play out here if there's a high-precision API, and if high precision is often needed, or if there is other info (like voice) which already makes privacy-preservation futile in almost all cases where face tracking is requested by the dev.

How common would it be for an app to ask for low-precision face tracking, but not ask for body tracking?

I don't have data on that. Body tracking does give a lot more information away because it returns the positions of the user's body. This makes it a lot more sensitive which is why I have not proposed it.

Similar question for voice/microphone and high-precision hand tracking. I'd guess that voice will almost immediately ~uniquely identify you, but even with a voice changer, my guess is that a few minutes of talking would give away a huge amount of information based on the words you say and the particular way you say them.

I agree that microphone already gives up a lot of privacy. I guess the browser implementors felt that they had to add it because it was such a strong use case. (Same for camera access)

All of that said, is obviously a good idea to give devs/users the ability to only request/give exactly the amount of information they need, and no more, so I think a low-precision API is a good idea in that sense. I'm just wondering how things will actually play out here if there's a high-precision API, and if high precision is often needed, or if there is other info (like voice) which already makes privacy-preservation futile in almost all cases where face tracking is requested by the dev.

I think you answered your own question: a web API should only report the minimum what it is designed for.
This API is designed to animate a person's avatar so things like eye tracking don't need to be super precise in space and time.
Making those optionally high precision and giving the user a choice will be very confusing.

It might a good time to revisit this. Today Safari for Apple Vision Pro is shipping the transient-pointer API that allows retrieving gaze information on a pinch gesture. WebXR / WebGL applications can implement selection by gaze but not hover effects on UI elements since gaze info is only able on pinch.

There's an interactive-regions proposal to let applications define regions that the browser / OS is in charge to highlight when user's gaze hovers them. This way, the gaze information is not exposed to the page. One could implement 2D UIs but at the expense on how flexible is to integrate them visually in the application since the OS and not the application is in charge of rendering in a separate layer. Additional challenge would be interacting with objects / geometries since highlighting those requires custom shaders that are application specific.

A different route described in this issue would be exposing eye tracking information to the page. This would be the simplest and more flexible API but poses additional privacy concerns. Wonder if those can be mitigated.

Likely that in the next 1-2 years eye tracking + gaze will be the common input to all consumer headsets. We will need to converge to a solution that enables cross-platform UIs for WebXR applications.

@dmarcos do you want to discuss this at the face to face next week? (March 25-26)

@cabanier thanks. maybe. where is it?

@cabanier thanks. maybe. where is it?

Meta offices in Bellevue. You can also call in if you don't want to fly

Tagging for /facetoface Incorporating gaze to WebXR experiences

/facetoface this was missed last week we can discuss it in the unconference time at the end of the day.