Reader for DFXP flavors

Question

Reader for DFXP flavors

fonic opened this issue 2 years ago · comments

This error appears when trying to convert a TTML subtitle to SRT:

# tt convert -i subtitle.ttml -o subtitle.srt
Input file is subtitle.ttml
Output file is subtitle.srt
A tt element is not the root element
Traceback (most recent call last):
  File "Python\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "Python\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "Python\Scripts\tt.exe\__main__.py", line 7, in <module>
  File "Python\lib\site-packages\ttconv\tt.py", line 423, in main
    args.func(args)
  File "Python\lib\site-packages\ttconv\tt.py", line 374, in convert
    srt_document = srt_writer.from_model(model, writer_config, progress_callback_write)
  File "Python\lib\site-packages\ttconv\srt\writer.py", line 199, in from_model
    ISD.generate_isd_sequence(doc, _isd_progress, is_multithreaded=isd_config.multi_thread if isd_config is not None else True)
  File "Python\lib\site-packages\ttconv\isd.py", line 321, in generate_isd_sequence
    sig_times = ISD.significant_times(doc)
  File "Python\lib\site-packages\ttconv\isd.py", line 237, in significant_times
    doc_regions = list(doc.iter_regions())
AttributeError: 'NoneType' object has no attribute 'iter_regions'

Subtitle:
subtitle.zip

Andreas Tai · Answer 1 · Wed Aug 17 2022 23:54:41 GMT+0800 (China Standard Time)

@fonic You use namespaces from an early pre-version of TTML. This version is also often referred to as DFXP. Before TTML became a W3C recommendation, namespaces changed. You must use the correct namespaces, which are documented, for example, in IMSC 1.2 (https://www.w3.org/TR/ttml-imsc1.2/#namespaces).

Fonic · Answer 2 · Thu Aug 18 2022 00:05:23 GMT+0800 (China Standard Time)

So it's basically a legacy version of TTML. Could you add support for that? That would be great since there seem to be a lot of TTMLs in this format (this one is from 2019).

Pierre-Anthony Lemieux · Answer 3 · Thu Aug 18 2022 00:10:32 GMT+0800 (China Standard Time)

So it's basically a legacy version of TTML. Could you add support for that? That would be great since there seem to be a lot of TTMLs in this format (this one is from 2019).

Couple of questions to help understand if/how that deprecated flavor can be supported:

does the document conform to any particular delivery specification?
what tool was used to generate it?
is the document intended for a specific user/platform?

Fonic · Answer 4 · Thu Aug 18 2022 00:18:14 GMT+0800 (China Standard Time)

It was downloaded from here using yt-dlp with option --write-subs.

I can't answer the other questions as I did not create the subtitles. What I can say is that TTMLs of newer videos from the same source are in a different format (and work with ttconv out of the box), so it would seem they switched the format as some point.

Andreas Tai · Answer 5 · Thu Aug 18 2022 00:26:48 GMT+0800 (China Standard Time)

Regarding @palemieux's first question:

The spec version that the document conforms to could be this one:

https://www.w3.org/TR/2009/CR-ttaf1-dfxp-20090924/

You can try to validate your file by using the schema provided here:

https://www.w3.org/TR/2009/CR-ttaf1-dfxp-20090924/#dfxp-schema-xsd

Fonic · Answer 6 · Thu Aug 18 2022 00:32:58 GMT+0800 (China Standard Time)

For testing, I replaced the namespace with http://www.w3.org/ns/ttml and conversion with ttconv now almost works, except for a duplication of all entries due to overlapping timestamps.

Pierre-Anthony Lemieux · Answer 7 · Thu Aug 18 2022 01:12:00 GMT+0800 (China Standard Time)

except for a duplication of all entries due to overlapping timestamps.

This is by design since it looks like the caption file is intended to mimic CEA 608 roll-up.

@andreastai Does WebVTT support regions for that style of captions?

What I can say is that TTMLs of newer videos from the same source are in a different format (and work with ttconv out of the box), so it would seem they switched the format as some point.

@nigelmegitt does BBC offer a tool to convert legacy DFXP documents?

For testing, I replaced the namespace with http://www.w3.org/ns/ttml and conversion with ttconv now almost works,

A few more tweaks are needed -- see subtitle.zip.

Fonic · Answer 8 · Thu Aug 18 2022 03:16:33 GMT+0800 (China Standard Time)

A few more tweaks are needed -- see subtitle.zip.

Thanks, that converts much better.

except for a duplication of all entries due to overlapping timestamps.

This is by design since it looks like the caption file is intended to mimic CEA 608 roll-up.

Just out of curiosity: what does that mean / why exactly are the entries overlapping?

Could an option be added to ttconv to correct this? From the look of it, the fix would simply be to use the previous entry's end timestamp as the start timestamp of the next entry whenever overlapping between two consecutive entries is detected.

Pierre-Anthony Lemieux · Answer 9 · Thu Aug 18 2022 05:10:33 GMT+0800 (China Standard Time)

Just out of curiosity: what does that mean / why exactly are the entries overlapping?

TL;DR: there is probably a bug in the tool that generated the content.

I looked closer at the document:

      <p begin="00:00:02.00" id="p0" end="00:00:04.96">This programme contains scenes which<br />some viewers may find upsetting,</p>
      <p begin="00:00:04.92" id="p1" end="00:00:06.64">and some strong language<br />from the start.</p>

It looks like the overlap (40ms) is exactly one frame at 25 fps, so here's what I think happened: the document was generated from a source file that contained in and out points and the converter incorrectly assumed that both in and out points were inclusive, whereas the out point is exclusive in TTML.

ttconv had a somewhat similar bug at some point: #258

Just a guess of course. Impossible to tell without looking at the source and the converter.

Nigel Megitt · Answer 10 · Thu Aug 18 2022 17:59:55 GMT+0800 (China Standard Time)

What I can say is that TTMLs of newer videos from the same source are in a different format (and work with ttconv out of the box), so it would seem they switched the format as some point.

@nigelmegitt does BBC offer a tool to convert legacy DFXP documents?

No, our pragmatic workaround while we have to support older documents is to make adjustments in the player to make a best effort at processing them. For example, in the BBC fork of imscJS we added namespace "fixing" code in bbc/imscJS@cf3d3e5 as well as some other fixes for example making a more useful default region: I think I'm right in saying that the default region definition came into the TTML specification at a later stage than the pre-Recommendation DFXP era, so such adjustments are beyond what's in any specification, and should probably remain that way.

Pierre-Anthony Lemieux · Answer 11 · Thu Aug 18 2022 21:16:15 GMT+0800 (China Standard Time)

@nigelmegitt Thanks for info!

Andreas Tai · Answer 12 · Thu Aug 18 2022 21:17:20 GMT+0800 (China Standard Time)

@andreastai Does WebVTT support regions for that style of captions?

WebVTT regions have a feature for that kind of subtitles (see § 1.4. Other caption and subtitling features, example 8). However, I am not sure to what extent this is implemented.

Pierre-Anthony Lemieux · Answer 13 · Thu Aug 18 2022 23:05:14 GMT+0800 (China Standard Time)

@fonic I changed the title of the issue to make it easier to find for folks that run into the same kind of documents.

Fonic · Answer 14 · Thu Aug 18 2022 23:38:33 GMT+0800 (China Standard Time)

@fonic I changed the title of the issue to make it easier to find for folks that run into the same kind of documents.

Sure, fine with me. Thanks for the background info, quite interesting actually.

It might be worth noting that the duplication caused by overlapping entries is not a big issue in practice (i.e. when using the converted SRT). You'll just see a quick flash, e.g.

1
00:00:02,000 --> 00:00:04,920
This programme contains scenes which
some viewers may find upsetting,

2
00:00:04,920 --> 00:00:04,960
This programme contains scenes which
some viewers may find upsetting,
and some strong language
from the start.

3
00:00:04,960 --> 00:00:06,600
and some strong language
from the start.

will result in entry 2 being visible for a single frame before being replaced by entry 3. Not perfect, but far from being unusable. Filtering out those duplicates would of course still be desirable.

Pierre-Anthony Lemieux · Answer 15 · Thu Aug 18 2022 23:40:44 GMT+0800 (China Standard Time)

Filtering out those duplicates would of course still be desirable.

It would be easier to simply drop all events that are less than 1/2 second long.

Fonic · Answer 16 · Thu Aug 18 2022 23:55:29 GMT+0800 (China Standard Time)

Filtering out those duplicates would of course still be desirable.

It would be easier to simply drop all events that are less than 1/2 second long.

But then you'll probably have someone open an issue due to a file that has longer overlaps - there will always be other malformed files out there. If you go that route, you should make this configurable, e.g. -d 0.5 to drop entries up to 0.5s in duration.

Pierre-Anthony Lemieux · Answer 17 · Fri Aug 19 2022 00:10:00 GMT+0800 (China Standard Time)

Any significant overlap is probably to convey roll-up. See the "roll-up" demo below (which uses IMSC):

https://www.sandflow.com/public/caption-styles-demo/index.html

Fonic · Answer 18 · Fri Aug 19 2022 01:51:06 GMT+0800 (China Standard Time)

Any significant overlap is probably to convey roll-up. See the "roll-up" demo below (which uses IMSC):

https://www.sandflow.com/public/caption-styles-demo/index.html

Ok, got you. So that's what 'roll-up' means.

As I got multiple of these legacy ttml subtitles to deal with, I created a Bash script to preprocess them for ttconv. Eliminating duplicates by adjusting the timestamps works fine, there are no noticeable issues during playback.