sandflow / ttconv

Subtitle conversion. Converts STL, SRT, TTML and SCC into TTML, WebVTT and SRT.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SCC > SRT error, Domesday LD Capture: AttributeError: 'NoneType' object has no attribute 'append_text'

rktcc opened this issue · comments

commented
ttconv 1.0.7 (pip install --pre ttconv)
python 3.11

head.scc.txt (rename from .txt to .scc since Github didn't like .scc.)

This is an SCC file extracted from a LaserDisc film captured using a Domesday Duplicator. Additionally, this is a Japanese language film.

This issue has occurred in the past with other Domesday captures but I used https://github.com/atsampson/ttconv until it stopped working now, and I can't sort out what changes they made before merging the updates to 1.0.7.

Unsupported SCC word: 0x7c                                                  
Unsupported SCC word: 0x7c                                                  
Unsupported SCC word: 0x107c                                                
Reading: |███████-------------------------------------------|  15% CompleteTraceback (most recent call last):
  File "/home/pip/.local/venv/ttconv/bin/tt", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/pip/.local/venv/ttconv/lib/python3.11/site-packages/ttconv/tt.py", line 439, in main
    args.func(args)
  File "/home/pip/.local/venv/ttconv/lib/python3.11/site-packages/ttconv/tt.py", line 320, in convert
    model = scc_reader.to_model(file_as_str, reader_config, progress_callback_read)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/pip/.local/venv/ttconv/lib/python3.11/site-packages/ttconv/scc/reader.py", line 621, in to_model
    context.process_line(scc_line)
  File "/home/pip/.local/venv/ttconv/lib/python3.11/site-packages/ttconv/scc/reader.py", line 556, in process_line
    self.process_text(word, line.time_code)
  File "/home/pip/.local/venv/ttconv/lib/python3.11/site-packages/ttconv/scc/reader.py", line 460, in process_text
    self.buffered_caption.append_text(word)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'append_text'

I wonder if possibly the capture has errors or is flawed and this is causing the "unsupported characters", or if it's just because Japanese character set is not supported?

thank you

@valnoel Can you look at this issue in the context of your work improving the SCC reader?

commented

It was raised to my attention that it seems the Japanese character sets are not present in the scc codes file, thus I imagine this might be a difficult task to achieve?

It was also noted that the content in the example is "two byte unicode", not sure if that's helpful. Just passing on some information from the Domesday group conversation.

Thanks to the maintainers for assistance!

The SCC reader does not currently support Japanese characters, which do not appear in the CEA-608 specification.

It seems an extension was once submitted to the specification, but I don't have any more information about it...

Otherwise, it seems CEA-708 introduces the Unicode characters support, which allow the display of Japanese and other languages.

@palemieux What do you think?

Otherwise, it seems CEA-708 introduces the Unicode characters support, which allow the display of Japanese and other languages.

Ok will look at this next week.

@rktcc Can you provide a link to the forum discussion thread? I could not find any specification for carrying arbitrary unicode characters in SCC.

@rktcc Can you provide a link to the forum discussion thread? I could not find any specification for carrying arbitrary unicode characters in SCC.

Hi, I am sorry for the delay.

Here is the discussion on ttconv missing Japanese character sets:

https://discord.com/channels/665557267189334046/676084498097766451/1140876443719577650

I think it's not the encoding and decoding that's wrong, there needs to be EIA-608 support added to ttconv and a way to detect EIA-608
https://github.com/sandflow/ttconv/tree/master/src/main/python/ttconv/scc/codes there's no Japanese character support at all
https://en.m.wikipedia.org/wiki/EIA-608

Here is a thought that the Norpak Non-Western addition may be what's needed...

https://discord.com/channels/665557267189334046/676084498097766451/1141486766579265576

Wikipedia says that there's non-western character support from Norpak https://en.m.wikipedia.org/wiki/EIA-608 under Non-Western Norpak Character Sets

Someone mentions a reference of CEA-608 set 6.4, Table 4, for Asian languages; however only PRC and (South) Korea are mentioned.

https://discord.com/channels/665557267189334046/676084498097766451/1141484827619635240

Referencing 6.4 Character Sets (Normative), 6.4.1 Standard, CEA-608
https://media.discordapp.net/attachments/676084498097766451/1141486499452432464/image.png

There's also a thought that it could be CC/Teletext, however as other subtitle content has been extracted from LaserDiscs using the Domesday, and converted from SCC to plaintext SRT, I would have to guess the Japanese SCC data would be the same, just the character sets missing from ttconv.

Did Japan use CC? ISTR that they had a teletext-like system for magazine-type data - which may also have worked for subtitles/closed-captions? (I know the wikipedia article mentions that two-byte stuff was added to the spec, but could that be like 50Hz being added to ATSC 1.0 - in an attempt to capture markets that didn't happen?) https://en.wikipedia.org/wiki/JTES was the teletext system (CCIR System D?)

I hope this is helpful in some capacity in either closing the ticket due to lack of project support, or adding some kind of additional processing.

If more info is needed I can look more. The Discord is free to join, sadly this is not hosted on an actual forum. Alternatively the general chat can be joined from IRC, on channel #domesday86 on https://libera.chat IRC network; you would not need to sign up for Discord in that case as a bot hands messages each way.

Discord Invite: https://github.com/happycube/ld-decode#documentation

Thank you again

I have joined the discord server.

In the meantime, I have spent some quality time staring at the sample file and it does not look like CEA 608 at all, e.g.:

image

Is that noise/errors from the laserdisc capture? Could it be something totally different like bitmaps?