immersive-web / proposals

Initial proposals for future Immersive Web work (see README)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Requirements for subtitles and text in WebXR

pthopesch opened this issue · comments

Introduction

Subtitle implementations exist for 360° web players, but the functionality is often limited (often, all styling and positioning is hard-coded in the player). Also, there is no established way yet for representing the subtitle service (or text in general) in 360° or even VR/XR environments. Existing subtitle formats and standards do not consider XR.

Target: Find a standardized way to support subtitles and text in VR and XR environments.

In the following description we collected:

  • Use case descriptions for subtitles in 360° (documented in a separate issue: #39 )
  • Requirements for subtitles in 360° videos
  • Links to existing standards
  • Thoughts on the direction for a solution

The use cases described were developed and implemented in the ImAc-project (Immersive Accessibility, www.imac-project.eu) and only consider subtitles. However, we suggest to discuss that topic with a more genreal scope in mind: Text in Mixed Reality environments.

Use cases

The use cases are documented in a separate issue: #39

Requirements

No. 1 - Transmission

Requirement description:

A possibility is required to transmit spatial information that relates a subtitle to the 360°/VR space in a suitable distribution format and in a standardized way. Spatial information includes:
a) Spatial information describing a direction
b) Spatial information describing depth or disparity

Further explanations:

Subtitle – In this context, a subtitle comprises all text that is active at a given point in time and that can be assigned to one speaker or sound source. That means, that for example a two-line teletext subtitle with each line assigned to a different speaker can be understand as two subtitles. Background: Each line would be enriched with its own spatial metadata.

Affected formats – The requirement relates to all formats and transmission paths/channels from the point where spatial metadata is created up to the end user device/player. This includes the distribution format and probably (depending on the workflow) a production format as well.

Relation between subtitle and VR space – The spatial information needs to be defined in such a way, that subtitles can be described as an object in 3D space during the rendering process (refer to requirement no. 2 and no. 4).

Examples & notes:

  • Spatial information could be described with two angles (azimuth, elevation) and a depth information or as a 3D vector (x, y, z).
  • The spatial information may not be used for rendering the subtitle, but for adding graphical elements like in use case 1 and 2. In this case it may be sufficient to include only some properties of the spatial information.
  • The spatial information may only describe a point in 3D space. Further information about how the subtitle object looks like (a plane in 3D space with a specific size, the surface of a sphere, a 3D text object) may be defined as part of the rendering context, e.g. by the distribution format used.

No. 2 - Coordinate System

Requirement description:

A 3D coordinate system is required that defines the relation between projected 360 video or VR scene and the subtitles. The properties of this coordinate system need to be available during authoring process.

Examples & notes:

  • Some kind of reference/zero point is needed that describes e.g. how a video is mapped in 3D space or which metadata contain that information. For instance, the direction described with two angles (azimuth = 0, elevation = 0) could point to the center of the equirectangular video texture.
  • The coordinate system may be described by the distribution format specification.
  • The coordinate system must be available during the authoring process such that the spatial information (refer to requirement no. 1) can be created correctly.
  • If important metadata (e.g. reference point for the center of a 360° video) are set after the authoring process (e.g. during playout), then, the spatial information of the subtitles must be updated accordingly in some way.

No. 3 - Authoring

Requirement description

A possibility is required to author spatial information as described under no. 1 in a suitable format.

Example & notes:

  • Existing subtitle production formats, like TTML based formats, could be used and enhanced with additional annotation data (to support VR).
  • Existing editor tools need to be extended to support this data. In a first step, the annotation for VR could be created in a separate step.
  • Annotation could also be done in a (semi) automated process in the future, e.g. when speakers can automatically be detected in the video picture.

No. 4 - Decoder behavior

Requirement description:

A standardized way to display subtitles in 3D space is required.

Examples & notes:

  • This requirement is listed here separately, but it refers to requirements no. 1 and no. 2. A distribution format specification may describe a decoding & presentation behavior.

Existing standards and possible directions for a solution

MPEG is currently working on MPEG-I which includes the Omnidirectional Media Format (OMAF). The current draft describes the possibility to include subtitles (supported formats: IMSC, WebVTT). The standard describes how a 2D plane with subtitles is added to the OMAF VR scene.

Link: https://mpeg.chiariglione.org/standards/mpeg-i/omnidirectional-media-format

The IMSC format only covers 2D media, but can easily be extended with user defined attributes. IMSC was used in the ImAc-project, were we extended it by additional attributed describing information we needed for the use cases described above.

Link: https://www.w3.org/TR/ttml-imsc1.0.1/

It seems like forward motion on this Issue has stalled so I'm going to close it. If you have a plan for how to make progress then ping me and we can re-open it.

@TrevorFSmith I would be very grateful if you could reopen this and #39

Let me explain the history of this issue:

This conversation on this topic dates back to end of 2017. We have been advised by the W3C immersive web colleagues to have real time examples and prototypes first and then come back with our requirements. That made sense. So we did come back with results in 2018.

In 2018 @cwilso pointed us to this repository where our findings should be discussed as proposal. We prepared #40 and #39. They have been submitted four month ago. We have not received any comment back by the XR community until now (which of course it not guaranteed).

In February 2019 this issue was discussed in a very well attended call of the W3C media entertainment group (see https://www.w3.org/2019/02/05-me-minutes.html). @AdaRoseCannon attended and reported back to the Immersive Web CG. As I can see from the Immersive CG minutes (https://lists.w3.org/Archives/Public/public-immersive-web/2019Feb/0017.html) there was no further discussion on this topic or a proposal how to make progress.

From this experience my question is how to handle this and other accessibility requirements for Web XR. Obviously accessibility requirements have not a strong lobby in XR and I understand there is a lot of other things to solve. As I understood from your procedure if there no one else interested in this matter (and comments seem to be the way how to measure this) the issue gets closed.

Clearly there is interest to solve this in the industry, there are implementations and the discussion in general did not stall (there is also an activity starting in the W3C Timed Text Working Group on this topic). But I am not sure on two things: Is this repository the right place to discuss such accessibility XR requirements? If not, it may be more efficient to use other channels. If it is: How to get the interest of the W3C XR community on such requirements? A comment from the W3C XR WG/CG experts would be good, even if they say that it is out of scope of their work.

@pthopesch Thanks for this detailed information. Do you think there is enough information to propose a draft specification?

+1 @TairT and reopening this (and related) issues.

Thank you for the context, @TairT.

It sounds like a good agenda item for one of our every-other-week CG calls. I'd like to give people time to read and prepare so tomorrow's call is too soon. Are you free to attend the next CG call on April 23rd at 10am Pacific?

In the meantime, I'll re-open this Issue.

Thanks @TrevorFSmith for reopening and @LJWatson for the comment.

It seems like a good idea to have a first discussion on this in one of the next XR CG call. Unfortunately I will be traveling on the 23rd and not be able to join the call on that date. I should be able to join a later CG call.

In this issue there are different requirements which need to be worked on in different specifications and working/community groups. For example the W3C Timed Text Working Group has decided to take a requirement on board to deal with the positioning in 360 (see w3c/tt-reqs#8). There are other requirements where we need to discuss where is the best place to deal with this question.

In include also the co-chair of the Timed Text Group @nigelmegitt on this thread.

@TairT Great! I'll put this Issue on the agenda for April 23rd.

One of the aspects of this work that I'd like to understand is how to think about the relationship between subtitles in videos with annotations, live translations, and other sorts of "text in the world."

While videos (spherical or volumetric) are one piece of the immersive web there are a variety of other uses where it would be good to show the user text in a similar manner to how we will show subtitles.

Two quick examples:

If someone is in a VR experience with animated characters then the experience creators will want to increase accessibility by showing subtitles for the words spoken by the characters.

If you're wearing AR glasses that place a 2D video on your living room wall then we'll need to be able to show subtitles for that video perhaps with an option for the subtitles to stay in your field of view so that you can follow along even when not looking at the video.

So, I guess I wonder if showing subtitles is one instance of a more general category of showing text that could be used (and probably abused) in a variety of XR situations.

@TrevorFSmith

Great! I'll put this Issue on the agenda for April 23rd.

Actually, at least for me it would be better to have the call after Apr 23. On that day I can not attend. I would be very interested to attend the call.

So, I guess I wonder if showing subtitles is one instance of a more general category of showing text that could be used (and probably abused) in a variety of XR situations.

Correct. This also one reason why we bringing it up in this community.

The concrete requirement from the IMAC EU project is accessiblity in 360. The scope of the TTWG work is subtitles/timed text. But beyond these stakeholders it is clear that it ticks the box for text in XR.

From the recent conversations I got the feedback that although the general question is text in XR, a solution for text in 360 may be more at hand. For VR and AR there much more open questions to be answered.

So one of the question would be if we should start to solve this concrete requirement for subtitles in 360 or should we work already on the more general question.

Thank's @TairT and @TrevorFSmith for picking up on this issue. I'm very happy to see interest in this topic from different sides.

I'd also like to join the CG call when this in on the agenda. 23rd of April is unfortunately not possible for me neither (I will be on vacation the whole week).

@TairT @pthopesch The CG call after that will be on May 7th and I'll move the agenda item.

That is the week of Google IO so I doubt we'll have many Googlers on the call. It's probably worth chatting with a couple of them beforehand.

@TrevorFSmith Thanks! You are right, it would be good to have them on the call. Let's think of moving the issue even two weeks later (May 21?). I think it would be better to have them all in one call to sync on this issue. Meanwhile we can continue to scope the topic in this issue. What do you think?

Ok, I'll add this to the May 21st CG call agenda.

Great! Thanks!

@TrevorFSmith Thanks for putting this topic on the agenda for the May 21st call.

My interest for the call would be to clarify the following (after we shortly discussed the requirements):

  • Given the current standard landscape: where are these requirements "in scope"?
  • Which W3C working/community group is the right community to address which requirements?
  • How can we limit the problem scenario so that a midterm standard solution is achievable?

This issue has been discussed in a call on May 21st with interesting input from different participants of the immersive web community group. @TrevorFSmith: How can I link to the minutes of the call from this issue?

In the abscense of minutes for last weeks CG meeting (please let me know if I missed the URI for the minutes), I would like to summarize a few topics that have been mentioned (from my personal recollection):

  • There was a general agreement that this falls in scope of the Immersive CG/WG but it is not clear in which regard. Therefore it seemed difficult to come to a concrete next step.

  • Although subtitles in XR includes also AR/VR it seemed OK to start with 360 videos.

  • For the use case of subtitles in 360 the W3C TTWG seems to be right group to handle part of the requirements.

  • 360 video is not a primitive in related immersive standards work, so the question would be from which point the Immersive CG/WG could start on this topic.

  • Obviously subtitles in XR put the more general question into focus how labels/text are rendered.

  • One proposal was to implement a user library that demonstrate how subtitles in 360 could be rendered and use this as the base to define requirements for the XR CG/WG.

@johnpallett You wondered what the role of web standards should be for 360 video subtitles, especially as this topic seems to be more transmission related from you perspective and you mentioned that prior work has been done in broadcast standards where one can build on. You offered to clarify this in this issue. Thanks for that 👍 Could you explain a bit your position/thoughts and also which broadcast standards you are referring to?

My suggestion was that there may be elements of the WebXR Device API that might be useful in transmission standards, possibly even existing elements. There are other examples of web APIs being used as part of such standards (e.g. ATSC). Without a more detailed analysis of the requirements and existing broadcast standards, though, I don't have a specific suggestion.

@johnpallett Thanks!

My suggestion was that there may be elements of the WebXR Device API that might be useful in transmission standards, possibly even existing elements

@johnpallett True. It may be useful to know what is already is ongoing. I am aware of the OMAF standard and the Guidleines of VR Industry Forum.

Without a more detailed analysis of the requirements and existing broadcast standards, though, I don't have a specific suggestion.

It may be indeed interesting to see if there is any overlap between OMAF and WebVR. An exchange between the two groups may be useful.

@TrevorFSmith You mentioned that the standardisation of the subtitle positioning with TTML may learn from the WebXR positioning scheme. Could you point me to the relevant sections in the XR spec?

@TairT Here is the current spec for XRSpace and XRReferenceSpace which are used to hold information about a coordinate system and poses within them. There is also a section on geometric primitives that describes how various aspects like matrices are represented.

One way to indicate the pose of the source of a particular piece of timed text (e.g. the person who spoke) would be to use an XRRigidTransform.

Thanks @TrevorFSmith! That is really helpful : ) I will have a look at it!

The requirements of this issue has been discussed at a general assembly of the EU funded project Immersive Accessibility (http://www.imac-project.eu/). See below for the outcome

Four different requirements were discussed:

A - Means to position subtitle regions in the coordinate system of a 360 video.

B - Means to position subtitle regions on a 2D plane in the field of view of the user.

C - Means to locate the audio source of the timed text horizontally (e.g. with a longitude coordinate).

D - Means to locate the audio source of the timed text horizontally AND vertically.

There was agreement that, based on user tests, requirement B and C are clearly MUST HAVE requirements. The past user test in IMAC have not shown evidence that requirement A has a MUST HAVE priority. It was commented that research done by the Ludwig Maximilian University Munich (LMU) have shown that subtitles "fixed" at the speaker location were considered helpful. It was also said, that there may be other forms to combining a speaker-oriented subtitle position with the constraint that a subtitle is always in the field of the user. It was agreed that for the moment for requirement A a priority cannot be given, and further investigation is necessary. Although requirement D is not a MUST HAVE requirement now it become one if requirement A is also implemented.

It was agreed to present the standard working group all requirements to see if working group members or other stakeholders would set the priorities differently.

IMSC was identified as the format where immersive specific data needs to be added to support authoring, exchange, distribution and presentation of 360 subtitles.

(Note: This comment was edited because in the original version Option A and B have been mixed up by mistake. Option B is a MUST have priority and Option A needs to be further investigated).

@TrevorFSmith After carefully reading through the WebXR specification draft, I have now a clearer view about the overlap with the requirements of the issue. As mentioned before by members of this group the specification handles the content send to a XR device as kind of WebGL rendered black box. In fact the IMAC project uses WebVR/XR to pre-render subtitles in WebGL and then send it to the device.

As written above the most pressing requirement is now to render text that is always visible in the field of view of the user (fixed to screen). The position would be static and on a 2D plane.

I think that is not currently in scope with the WebXR API. There may be an overlap with the DOM overlay (https://github.com/immersive-web/dom-overlays).

Subtitles defined in the IMSC format need to be pre-rendered by the browser as DOM elements and then send to the device.

Thoughts?

@TrevorFSmith It would be helpful to find the minutes from the community call of the XR group on May 21. Can you post a link to them?

@TairT Can I ask a clarification question: in #40 (comment) when you talk about "the coordinate system of a 360 video", are you talking about 3-axis angular positioning only, or also about translation positioning, e.g. position within a 2D plane, and altitude for the 3rd dimension?

Please note that #40 (comment) was edited. The options and implementation priorities have been mixed up by mistake.

@nigelmegitt regarding the requirement A in Andreas post, it is hard to provide more details from our researches since the user tests we did so far didn't confirm that positioning of subtitles fixed to the video is a strong requirement.
However, I think that besides specifying a direction in the 3D coordinate system, the depth information might be a part of this requirement, for instance to cope with stereoscopic videos as well.
I'm not sure if I understand what you mean by "translation positioning". Do you mean shifting/tilting of the 2D plane, or do you refer to what's rendered onto the 2D plane?

@pthopesch I mean that in some VR/360º environments the user can walk around (translation positioning) as well as turning around (rotational positioning). I wasn't referring to positioning on a 2D rendering plane at all: I meant that the 3 dimensions of "walking around" movement are made up of 2D left/right + forward/backward as well as the 3rd dimension of altitude.

@TairT Here are the minutes from the IW call on May 21st: https://www.w3.org/2019/05/21-immersive-web-minutes.html

Regarding rendering "fixed to screen" text: In immersive displays, fixing content to the user's POV generally leads to nausea in a large percentage of people. I think the question we should be asking is how to provide timed text positions relative to coordinate in a 360 video (e.g. words spoken by a character on screen) or associated with the video as a whole (e.g. an off-screen narrator's words). This way the text can be moved along with the content or follow the user around with a bit of lag as if it's a physical object in the scene.

Since each WebXR experience is rendered by a WebGL custom engine, the best path might be to standardize the timed text information stream and then work with the engine teams to write code that correctly renders the timed text.

@nigelmegitt sorry I'm answering so late.

We have not looked into environments with 6 degrees of freedom (6DoF), where the users head movements are tracked and translated into the scene. We only worked with 3DoF where the user can just look around (through a virtual camera). We neither looked into let's say game-like experiences where the user can navigate through a scene for example by using a controller.

Actually, all tests we did were based on 360° video playback in VR glasses where the virtual camera is always positioned in one place (mostly the origin of the coordinate system and the center of the video sphere).

Of course, that doesn't mean that there is no such use case. Do you have something specific in mind or is it a general thought on this issue?

@pthopesch I would expect us to try to tackle the general case of 6DoF (nice terminology!) because AR and VR systems do use it, and understand how the solution collapses to something simpler if only 3DoF is needed.

@nigelmegitt @pthopesch The conclusion I got from different discussions in different constellation has been to first work out solutions for the three degree of freedom use case for 360 videos. I understand the thinking, @nigelmegitt, to get this right for the big picture (AR, VR, 360 and games). But at the moment the more pragmatic strategy for me would be to concentrate on the operational requirements as they have been tested with users. From my understanding a solution for the 6DoF use case is also far more complex and will take more time. It may be good to document it as a long term requirement.

@TrevorFSmith Thanks for the minutes : )

Regarding rendering "fixed to screen" text: In immersive displays, fixing content to the user's POV generally leads to nausea in a large percentage of people.

Thanks. Do you think this also applies to text? Can you point to accessible research on this topic?

I'd like to provide two links here that might be useful for this discussion:

  1. A demo of the player from the ImAc project the subtitle requirements of this issue#40 derived from can be seen here: https://imac.gpac-licensing.com/player/
    The "Opera" content might be good to start playing around with since most access services have been created for this clip. It should run in most browsers that support WebVR, although I have problems with Firefox on my laptop. A short demo video of the player is also available, showing the main features including subtitles: http://www.imac-project.eu/immersive-corner/tutorials/

  2. The player developed in the project is available on github here: https://github.com/ua-i2cat/ImAc

@TairT Regarding "simulator sickness" I'll leave it to @blairmacintyre to weigh in on academic references.
Oculus does have a somewhat thorough explanation of tactics to use to avoid the sickness in the vision section of their design best practices manual.
One approach is to render multiple copies of the timed text around the user so that no matter their direction of view they see one of the copies.
Another approach would be to have the timed text render in front of the user in a fixed position in the environment but when they look in a different direction the text (after a short pause) moves into a new position within the user's view. This way the text isn't "locked" onto their view (so it's not a "HUD") but is generally available.

@TrevorFSmith Thanks a lot for your comments and references. Very helpful!

@TrevorFSmith @cwilso This may be a good time to move the discussion of this thread in a different environment. I propose to open a repo in the WCIG.

For background info see also the the minutes of the discussion in the Timed Text Working Group on 2019-08-01 (https://www.w3.org/2019/08/01-tt-minutes.html#x06).

Although the kickoff of the specification activity for an TTML extension has started in the Timed Text Group (see w3c/tt-reqs#8), we need experts from different areas. @pthopesch and I would be happy to write an explainer and draft some first spec text.

I added a proposal for the WIGG and drafted with @pthopesch am explainer.

Note also that there is a new W3C community group proposal for immersive captions (https://www.w3.org/community/groups/proposed/).

There was a very good discussion of this issue. See below for the minutes of the three meetings where it was discussed:

Hi, is there any web player support subtitles that can follow and fixed in the front of the face?

@hungnm10 Can you give more details on the desired rendering? Do you imagine something similar to the YouTube Player where the subtitles stay always in the same position regardless of the viewer position in the 360 space?

Yes, exactly the same Youtube Player, but we need a player to stream video from our server