immersive-web / depth-sensing

Specification: https://immersive-web.github.io/depth-sensing/ Explainer: https://github.com/immersive-web/depth-sensing/blob/main/explainer.md

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Potentially incorrect wording in the specification

bialpio opened this issue · comments

When going over the spec for issue #43, I have realized that we may have a mismatch between what the specification says, and what we do in our ARCore-backed implementation in Chrome.

Namely, the spec says that in the buffer that we return, "each entry corresponding to distance from the view's near plane to the users' environment".

ARCore's documentation seems to have a conflicting phrasing:

  1. In ArFrame_acquireDepthImage(), we have "Each pixel contains the distance in millimeters to the camera plane".
  2. In Developer Guide, we have "Given point A on the observed real-world geometry and a 2D point a representing the same point in the depth image, the value given by the Depth API at a is equal to the length of CA projected onto the principal axis".

If ARCore returns data according to 1), then I think it'd be acceptable to leave the spec text as-is, but then our implementation may not be correct (namely, I think we may run into the same issue that causes @cabanier to need to expose at the very least the near plane distance that ARCore internally uses?).

If ARCore returns data according to 2), then the values in the buffer we return are not going to depend on the near plane. In this case, we are not going to be compliant with the spec (we don't have a distance from near plane to user's environment), and the only way to be compliant will require us to adjust each entry in the buffer - this may be expensive given that this'll happen on CPU. IMO the best way to fix this would be to change the spec prose here, but I think this may be considered a breaking change, so we'll need to discuss how to move forward.

I'm going to try to confirm with ARCore what is actually their behavior, I'm not sure if this issue is actionable until that happens.

I was told that the OpenXR API returned near and far plane as well as fov to make sure that the code that calculates the scene will use the same matrices as the system code that calculates the depth texture.
The values in the buffer can be used directly by the shader. I think that means that this matches with point 2.

Are you sure that you need to do the adjustment in that case? Are you adding the near plane distance in your shaders?

I was told that the OpenXR API returned near and far plane as well as fov to make sure that the code that calculates the scene will use the same matrices as the system code that calculates the depth texture.

Speaking of OpenXR, can you point me to the API or extension in OpenXR that you use for this?

The values in the buffer can be used directly by the shader. I think that means that this matches with point 2.

If the values in the buffer can be used directly by the shader for occlusion, then I'm 99% sure that they match point 1. It'd mean that they already went through some projection matrix (and yes, if you don't know near, far, & FOV of that matrix, then there's not much you can do with the data), which means that they are going to be normalized to range [0, 1] where 0 means that user's environment is at camera's near plane (or closer?), and 1 means that the user's environment is at camera's far plane (or further?) - i.e. this is equivalent to "entries are the distance from camera's near plane to the environment, in unspecified units".

If the data were returned according to point 2, then system's near & far is not needed - you have data in some physical units, in eye space ("distance from the camera to user's environment"), and you can use them in the shader for occlusion if you transform them by your own projection matrix first (pick near & far in whatever way works for you, just make sure FOV matches).

Are you sure that you need to do the adjustment in that case? Are you adding the near plane distance in your shaders?

I'm quite certain that if it's "distance from the camera to the environment" (option 2) and the spec says it should be "distance from view's near plane to the environment" & we decide not to change the spec, then an adjustment over the entire buffer is going to be needed, simply because those 2 things aren't the same.

I was told that the OpenXR API returned near and far plane as well as fov to make sure that the code that calculates the scene will use the same matrices as the system code that calculates the depth texture.

Speaking of OpenXR, can you point me to the API or extension in OpenXR that you use for this?

I didn't find it on the Khronos site but it is listed on ours: https://developer.oculus.com/documentation/native/android/mobile-depth/

I have confirmed that ARCore returns the depth data according to pt.2.

Which means that we need to decide how we want to make progress here. It seems that we have 2 systems returning data in 2 different ways, and we'd like to not mandate anything that'd incur large costs on the implementers (e.g. performance impact of mandating adjusting the data in some manner). We also need to thread the needle carefully if we do not want to make a breaking change.

At a minimum, I think the description of the data contained by XRCPUDepthInformation should be changed to match the reality of what is currently returned (we'd also need an additional subsection for interpreting the results) . This'd mean that XRWebGLDepthInformation will return different data compared to it - it is a potential trap for the users I think. I could maybe explain it away by saying that if you care about depth on CPU, it probably means you want it for physics, and if you care about it on GPU, it is probably for occlusion - this way, a difference in the data becomes more acceptable, and we may not need further changes to the spec (except we'd need to solve #43).

@cabanier, @toji - do you have any early thoughts here?

Hm... this is tricky. I certainly don't want to break anyone, but I also question how many apps are already making use of these values. I think it's likely to increase with the Quest adding this functionality, but it's being exposed there in a fairly different manner, so I think we have the opportunity to make some changes to how the data is interpreted now, as anyone who's interested in expanding their existing app's compatibility will have to update their usage regardless.

I'm also reluctant to enforce data transformation to a specific space. As @bialpio points out there's probably different spaces that make sense for different use cases, and if we're pushing for the system to normalize we could end up just forcing devs to undo a spec-mandated transformation because we chose the "wrong" space for their use case.

In other words, if there's going to be transformations anyway, lets leave them in the hands of the person who knows best what's needed: the developer.

I feel like adding the data from #43 is the ultimate solution, because then there's no ambiguity about what the range is and different systems can conform to any requirements imposed on them by their hardware/platform. If you measure from the camera? depthNear = 0. Measure from the projection near plane? depthNear = nearPlane. Having a far plane in place as well will be helpful for devs in terms of doing the math to shift the values as needed. (Maybe we want to say that the far plane can't be infinity? Not sure if that would make the math harder for devs of not.)

I'm also reluctant to enforce data transformation to a specific space. As @bialpio points out there's probably different spaces that make sense for different use cases, and if we're pushing for the system to normalize we could end up just forcing devs to undo a spec-mandated transformation because we chose the "wrong" space for their use case.

I agree that we don't want to tie this to a space. Quest is returning depth far/near so the author can feed that back into WebXR, not to interpret the values of the depth buffer differently.

Having a far plane in place as well will be helpful for devs in terms of doing the math to shift the values as needed. (Maybe we want to say that the far plane can't be infinity? Not sure if that would make the math harder for devs of not.)

That won't work for Quest because afaik it will always report infinity for the far plane.

Sorry, I didn't mean to suggest "space" in the the proper WebXR sense here. I meant "depth range".

That won't work for Quest because afaik it will always report infinity for the far plane.

Noted. So I guess we'd have to at least enable the possibility of a far plane at infinity, unless we're really confident that's what all implementations are going to do.

/facetoface to chat about the best way to move this issue forward.

I think it might be too late to add it to the agenda at this point but we will take a look Monday morning to see how we can fit it in. Please remind me on Monday.

Discussed during the F2F. Conclusions:

  • we need to expose the depthNear - this will be a breaking change (not in API shape itself, but in a way that it is used)
  • we don't need to take account non-linearities when exposing normalized data - if we ever get a system that exposes data that had a non-linear function applied on it, we'd need a V2 version of the API
  • we probably don't need depthFar even if data is normalized to [near, far] range because this can be handled by rawValueToMeters factor. @bialpio to show the receipts aka math equations to convince others (and himself)
  • exposing depthFar is easy, but may break mental model of the app developers (because depthFar can be +Inf).