immersive-web / depth-sensing

Specification: https://immersive-web.github.io/depth-sensing/ Explainer: https://github.com/immersive-web/depth-sensing/blob/main/explainer.md

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Early feedback request for the API shape

bialpio opened this issue · comments

Hey all,

I'd like to ask people to take a look at the initial version of the explainer and let me know if there are any major problems with the current approach (either as a comment under this issue, or by filing a new issue). I'm looking mostly for feedback around API ergonomics / general usage, and possible challenges related to implementation of the API on various different kinds of hardware / in different browsers / etc.

+@toji, @thetuvix, @grorg, @cabanier, @mrdoob, @elalish

A couple of first thoughts:

  • Is there a way to directly push this data to the GPU? Having an intermediate JS buffer will create overhead (and more GC). If the data is going to be used by the GPU, it seems more appropriate to surface it there.
  • How is a single depth buffer going to work with stereo? Will each eye have its own depth buffer?

Thanks for taking a look! Responses below.

  • Is there a way to directly push this data to the GPU? Having an intermediate JS buffer will create overhead (and more GC). If the data is going to be used by the GPU, it seems more appropriate to surface it there.

We can extend the API to expose this capability via a method on XRWebGLBinding to hand out a WebGLTexture directly - would this work here? This way, implementations would be able to implement the API with less overhead if the underlying APIs allow it to.

2 things worth calling out about the approach with WebGLTexture:

  • What would be the expected lifetime of the texture? In our initial prototypes w/ raw camera access API, we have run into the issues related to that. We wanted to fully control the texture's lifetime, which is not really possible when exposing WebGLTexture, & implementations relying on swap chain / reusing buffers might run into this as well.
  • In our implementation, we'll have to push the data to the GPU ourselves since ARCore does not provide a GL texture for us to use. This is fine, we can call the glTexImage2D() in the native code, but it does mean that there's still going to be some overhead.
  • How is a single depth buffer going to work with stereo? Will each eye have its own depth buffer?

I'd say each eye can have its own depth buffer - the API already accepts XRView which is associated with a specific eye, so the implementation can use that to decide what needs to be returned - if it is able to provide stereo depth buffers, it could do so.

Let me know if that makes sense!

What would be the expected lifetime of the texture

The layers spec defines a texture that is only valid during the Raf call where it was created: https://immersive-web.github.io/layers/#xropaquetextures
Maybe you could use something similar? I'd love to see other specs use these texture type so we can all work on the proper definition.

In our implementation, we'll have to push the data to the GPU ourselves since ARCore does not provide a GL texture for us to use. This is fine, we can call the glTexImage2D() in the native code, but it does mean that there's still going to be some overhead.

You could also do it all in the browser process so you can avoid having to send the data to the render process and back.

I'd say each eye can have its own depth buffer

Does that mean that devices that have a single physical depth sensor won't be able to support this spec? Or would they have to reproject the depth map for each eye?

The layers spec defines a texture that is only valid during the Raf call where it was created: https://immersive-web.github.io/layers/#xropaquetextures
Maybe you could use something similar? I'd love to see other specs use these texture type so we can all work on the proper definition.

Ooh, thanks for the pointer! I think this is effectively what we currently do in our prototype implementation for raw camera access API, but I was concerned about exposing the data as a WebGLTexture since that seemed to me like a potential footgun - for example, users could still try and call gl.deleteTexture() on them. Maybe we can try to work around this by providing very detailed diagnostics for the cases where the textures are not used correctly.

You could also do it all in the browser process so you can avoid having to send the data to the render process and back.

Good point, if we do not care about exposing the data on the CPU at all, this is definitely something we should do. The question then is: should we focus only on the GPU use case? From what I've seen, the data we get from ARCore is detailed enough that leveraging it on the CPU for physics is probably going to be beneficial, but for GPU use cases (occlusion), it won't be good enough.

Does that mean that devices that have a single physical depth sensor won't be able to support this spec? Or would they have to reproject the depth map for each eye?

Either reproject the depth map, or expose an additional XRView that will only be used to obtain that single depth map. In that case, it may be better to expose the more sensor-oriented features via something else than XRView (raw camera access API also happens to need such mechanism, so we can think about the best way to expose those kinds of features).

users could still try and call gl.deleteTexture() on them. Maybe we can try to work around this by providing very detailed diagnostics for the cases where the textures are not used correctly.

deleteTexture should indeed not be allowed. I will update the spec.
Can you think of other calls that should be disallowed?

Does that mean that devices that have a single physical depth sensor won't be able to support this spec? Or would they have to reproject the depth map for each eye?

Either reproject the depth map, or expose an additional XRView that will only be used to obtain that single depth map. In that case, it may be better to expose the more sensor-oriented features via something else than XRView (raw camera access API also happens to need such mechanism, so we can think about the best way to expose those kinds of features).

I suspect the system would have to do the reprojection because it would be a big burden on the author to do this correctly.
Do you know of any stereoscopic systems that expose a depth map?

users could still try and call gl.deleteTexture() on them. Maybe we can try to work around this by providing very detailed diagnostics for the cases where the textures are not used correctly.

deleteTexture should indeed not be allowed. I will update the spec.
Can you think of other calls that should be disallowed?

No, the gl.deleteTexture() is what I was mostly concerned about.

Does that mean that devices that have a single physical depth sensor won't be able to support this spec? Or would they have to reproject the depth map for each eye?

Either reproject the depth map, or expose an additional XRView that will only be used to obtain that single depth map. In that case, it may be better to expose the more sensor-oriented features via something else than XRView (raw camera access API also happens to need such mechanism, so we can think about the best way to expose those kinds of features).

I suspect the system would have to do the reprojection because it would be a big burden on the author to do this correctly.

Won't this depend on the intended usage of the depth map? If the goal is to perform some simple environment reconstruction based on depth map data, I imagine the authors will already need to know what they are doing when trying to leverage the data we return.

Do you know of any stereoscopic systems that expose a depth map?

Unfortunately, I do not - I've seen articles about HoloLens Research Mode that referred to depth cameras, but I did not see a way to access the data through the API. FWIW, based on documentation available from ARCore, it seems that RGB camera is sufficient to provide a (maybe less precise) depth map, so maybe it's something that could be made available on stereoscopic devices with multiple cameras as well.

Call for action for potential implementers of the depth-sensing API! Please take a look at the issues linked above, explainer, and at the PR #8 with early draft of the spec (it should hopefully contain most of the relevant details), taking extra care to ensure that the API can be implemented in other browsers!

Other things that will be adjusted that I am tracking:

  • #6 - meters vs millimeters - need to expose the data in a format-agnostic way to allow for efficient implementation on different platforms
  • #10 - lock down XRDepthInformation to be useable only when XRFrame is active

/agenda for visibility & to ensure that the initially proposed API shape is implementable for other form factors

For texture lifetime, consider a flow where the user allocates a texture, and receives data in it through texImage2D.

In the explainer, it looks like the proposal is to use texImage2D to upload a buffer that's in memory. The gpu->cpu->gpu round-trip could be minimized by creating a new source for texImage2D, similar to gl.texImage2D(target, level, internalformat, format, type, HTMLVideoElement? pixels); or gl.texImage2D(target, level, internalformat, format, type, HTMLCanvasElement? pixels);.

We really should have a texImage2D(XRDepthInformation), and make the cpu-data path optional. (maybe Promise<Uint16Array> getData()?)

In particular, having to go through the cpu data path means a couple of extra copies:

  • 1: getDepthInformation: 1-2 copies
    • A: (maybe) readback from GPU to cpu
    • B: copy from privileged process to content process
  • 2: texImage2D(ArrayBufferView): 2 copies
    • A: copy from content process to privileged process
    • B: upload to GPU resource

There are some concerns about how to upload this to WebGL but I think DEPTH_COMPONENT16 would always work, and would be great for this, allowing both direct sampling of values (through the R channel) as well as supporting depth-sampling operations, which I can imagine might be desirable here. ;)

and make the cpu-data path optional. (maybe Promise<Uint16Array> getData()?)

Async API here, definitely would be not good, as real-time applications need this data in sync manner. There is time in background IO, to pass that data before it is needed by JS thread, so sync blocking - should not be an issue.

On HoloLens 2, serving up depth images to apps will involve rendering world mesh on the GPU, and so it would be wasteful for an app that intends to use the data on the GPU (e.g. for occlusion) to move the images from GPU to CPU and back to GPU again.

Excited to find the right GPU-first path here for apps that end up using GPU-derived images on the GPU!

Please take a look at PR #11, hopefully it addresses the issues we chatted about during the call. Most importantly, it allows the user to specify the desired usage of data, thus allowing the user agents to do the right thing and minimize the amount of data round trips. Additionally, it allows the UAs to express their preferred usage and data format, with the intent that if the user picks the preferred settings, it would incur the lowest possible cost.

I took a look at the current API and couldn't find anything about "raw depth" information - I'm not sure if this is the right place to ask/give feedback.

Currently the reported depth values seem to be smoothed which might be good in a lot of cases, but in situations where the device (like a smartphone) is used to 3D scan an environment (just collect data, and post process later) it would be great if there would be an option to report raw depth (and confidence if available).

I haven't thought about what a good API for that would look like, but IMO it would certainly be good to have that option because currently the only option is to use native arcore (or arkit) to do that.