asticode / go-astisub

Manipulate subtitles in GO (.srt, .ssa/.ass, .stl, .ttml, .vtt (webvtt), teletext, etc.)

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

What to expect to be supported by ttml parser?

shlompy opened this issue · comments

Hi.
I see there are different ttml structures which the parser doesn't seem to support.
For example, the parser expects the subtitle "items" in the body to be in p tags, with the styles and region attributes.
I came across this ttml, where the body contains div of region, and within this div all the related subtitles of that region, without any region attribute:

<tt xmlns:ttp="http://www.w3.org/ns/ttml#parameter" xmlns="http://www.w3.org/ns/ttml"
    xmlns:tts="http://www.w3.org/ns/ttml#styling" xmlns:ttm="http://www.w3.org/ns/ttml#metadata"
    xmlns:ebuttm="urn:ebu:metadata" xmlns:ebutts="urn:ebu:style"
    xml:lang="eng" xml:space="default"
    ttp:timeBase="media"
    ttp:cellResolution="32 15">
  <head>
    <metadata>
      <ttm:title>DASH-IF Live Simulator</ttm:title>
      <ebuttm:documentMetadata>
        <ebuttm:conformsToStandard>urn:ebu:distribution:2014-01</ebuttm:conformsToStandard>
        <ebuttm:authoredFrameRate>30</ebuttm:authoredFrameRate>
      </ebuttm:documentMetadata>
    </metadata>
    <styling>
      <style xml:id="s0" tts:fontStyle="normal" tts:fontFamily="sansSerif" tts:fontSize="100%" tts:lineHeight="normal"
      tts:color="#FFFFFF" tts:wrapOption="noWrap" tts:textAlign="center"/>
      <style xml:id="s1" tts:color="#00FF00" tts:backgroundColor="#000000" ebutts:linePadding="0.5c"/>
      <style xml:id="s2" tts:color="#ff0000" tts:backgroundColor="#000000" ebutts:linePadding="0.5c"/>
    </styling>
    <layout>
      <region xml:id="r0" tts:origin="15% 80%" tts:extent="70% 20%" tts:overflow="visible" tts:displayAlign="before"/>
      <region xml:id="r1" tts:origin="15% 20%" tts:extent="70% 20%" tts:overflow="visible" tts:displayAlign="before"/>
    </layout>
  </head>
  <body style="s0">
    <div region="r0">
      
      <p xml:id="sub16000" begin="00:00:16.000" end="00:00:17.000" >
        <span style="s1">eng : 00:00:16.000</span>
      </p>
      
      <p xml:id="sub17000" begin="00:00:17.000" end="00:00:18.000" >
        <span style="s1">eng : 00:00:17.000</span>
      </p>
      
    </div>
  </body>
</tt>

In some other cases, the

elements might have style attributes, but these element are also a child of a div element which also have some styles associated which should be inherited, but this package doesn't seem to look at any div inside the body:

   <body ttm:role="caption">
      <div style="autogenFontStyle_n_150_120 S1 StyleFillLineGapTrue fontFamilyStyle">
         <p begin="00:00:01.000" end="00:00:02.000" region="R6" style="S4" ttm:role="sound" xml:id="C1">
            <span style="S3">FIRST SUBTITLE, WHA!!!!!! C1</span>
         </p>
         <p begin="00:00:03.000" end="00:00:04.000" region="R6" style="S4" ttm:role="sound" xml:id="C2">
            <span style="S3">PHONE RINGS C2</span>
         </p>
         <p begin="00:00:05.000" end="00:00:06.000" region="R6" style="S4" ttm:role="sound" xml:id="C3">
            <span style="S3">PHONE RINGS C3</span>
         </p>

In another ttml I have, the structure is as follows which seems to be parsed fine (Except for the
which I raised another issue).
the style attribute is directly on the

element and the region as well...

<body>
  <div>
  <p style="style.center.outline" begin="00:22:31.000" region="r0" xml:id="p264" end="00:22:33.720" ><span tts:direction="ltr">Got you!<br/>Steady on.</span></p>
  </div></body>

Are there multiple known types of TTML formats or versions so I can know which are supported by this package?
It seems to be a big problem for parsing such structures, at least not something which can be achieved by mapping the tags into go structs....

This TTML subs are so frustrating comparing to other subtitle formats, and there is no clear documentation about it and all the different possible structures it may have....

To be honest you shouldn't have to worry which format can be parsed by this lib as it should be able to parse all formats. Unfortunately it doesn't right now. The only common pattern between your 3 examples is that structure : body > div > p.

Right now here's what is missing:

  • parse body attributes
  • parse first div attributes
  • for each p, really parse the exact xml + attributes inside it

I'm welcoming PRs.

Cheers

Thanks..
I just wonder if the subtitles can also have different structures which I haven't faced yet (e.g. body->div->div->div->p)..
Also, correct me if I'm wrong, but it seems to me styles are not inherited currently, and eventually a subtitle item holds only the inline style, and the style id of that item for another style... Not traversing up the nodes to inherit all parent styles...

Not sure if I'll succeed with adding such changes to a PR, I also need to support TTML with subtitles which have a bit of different body, and probably doesn't suit this package is used for conversion of other formats which do not support images...

I still wonder if it will be easier for me to enhance this package or do something from scratch, probably with xpath which might make nodes traversals easier than mapping to structs with xml tags...

If I were you, I would enhance this package but I guess I'm biased 😄 Using xpath instead of standard go xml unmarshaling could be a good idea and, if implemented correctly, could handle all types of bodies.

If you decide to enhance this package, let me know, I'll try to point to the right direction 👍

Thanks Renard
I think I'll pass enhancing this package, might be too expensive for me to refactor it, and probably to expensive for me to use it as I only need to parsing part for converting to proprietary format which should be on server side serving +100k clients... Need this to slimmed down and more efficient... But I should defiantly use it as a reference to understand how to parse the text properly...

no worries, good luck! ❤️