Better captioning

Question

Better captioning

veltman opened this issue 8 years ago · comments

Have a mostly-working branch that allows for entering and positioning multiple captions, but the manual entry/interface is a real drag, especially for a long video. Worth exploring some improvements.

Forced aligners?

Using a forced aligner like Gentle to take a bulk transcript and automatically time it to the audio would help - then you could type in the whole thing (or paste from a transcript) and it could automatically break it into chunks.

Pros: Much faster if you have a full transcript already (paste the whole thing rather than pasting line-by-line and tweaking the timing).
Cons: Not much faster if you don't have a transcript. A lot more code complexity (all the OSS aligners seem to be Python). Would probably still need to tweak the captions into sensible breaks (e.g. avoid orphan words).

Auto transcribe

Use some sort of speech-to-text to take a first pass at transcribing the audio. In-browser options include PocketSphinx and the Web Speech API in certain browsers. Server-side options include normal Sphinx or the Watson API.

Pros: Great when it works.
Cons: Doesn't always work, especially for non-English languages or clips with music, background noise, etc. Still doesn't work out timing. If it's server-side, would require a second round-trip before the form submission. Could take a long time for long pieces of audio.

Parse timestamped transcripts?

Could allow people to upload an SRT or some other timecoded transcript format in the editor. The parsing wouldn't be that hard, but it's unclear how often audio orgs use these.

Noah Veltman · Answer 1 · Sun Jul 31 2016 04:05:36 GMT+0800 (China Standard Time)

Looks like the Web Speech API doesn't provide any way to connect it to a non-mic source, but PocketSphinx does (with some fiddling).

Andrew Kuklewicz · Answer 2 · Mon Aug 01 2016 23:48:07 GMT+0800 (China Standard Time)

you could also use other APIs like speechmatics (https://speechmatics.com/), or https://cloud.google.com/speech/ ?

Noah Veltman · Answer 3 · Mon Aug 01 2016 23:51:31 GMT+0800 (China Standard Time)

Yup, true - though I'm a little reluctant to rely on an external API rather than something that can be bundled (ditto Watson).

Pietro · Answer 4 · Thu Dec 01 2016 03:00:28 GMT+0800 (China Standard Time)

Hey @veltman,
Gentle could be modified to generate a transcription when the text is not available. This already works in the REST API, see the curl example if you don't pass the text file it returns a transcription. but it doesn't work in the python terminal command. The code would need to be modified accordingly, which is something I am looking into.

I also played a round with pocket sphinx, packaging it as a node module https://github.com/OpenNewsLabs/offline_speech_to_text.
I extracted it from video grep electron app.

Ian McDonald · Answer 5 · Mon Dec 19 2016 17:45:13 GMT+0800 (China Standard Time)

Considering that the effective maximum on social media is 30s, I think that expecting users to supply a transcript is absolutely fine.

It doesn't scale to generating complete videos from long-form shows, but I think that's acceptable - it's still a big benefit for most uses.

I'm a one-person band working on my own community/radio niche narrative history series, and I've used SRT, using a free online manual transcriber (called, originally enough, "Transcriber"). Though I'm about as unrepresentative as you can possibly get.

Pietro · Answer 6 · Wed Jan 11 2017 00:57:36 GMT+0800 (China Standard Time)

For the srt option I've wrote an srt parse composer that is also on npm.

Can be used to parse the srt into a word accurate json (original code to make it word accurate is from popcorn js srt parsing module parser also on github) with that is possible to make a "hyper transcript" where the user can make word accurate selections. I've done something similar in quickQuote (now refactored in node and in autoEdit) inspired by the hyperaudio project.

Alberto Pettarin · Answer 7 · Mon Mar 06 2017 18:44:23 GMT+0800 (China Standard Time)

Shameless plug, I hope you find it informative.

I maintain a Python/C forced aligner called aeneas ( http://www.readbeyond.it/aeneas/ and https://github.com/readbeyond/aeneas/ ). Its approach is not based on speech recognition (like Gentle and basically all other forced aligners out there), but on an older technique known as Dynamic Time Warping. It works decently well (and much faster) if you align text at sentence/phrase level, but it is worse at word-level. Its real time factor (ratio between processing time and real audio length) is between 0.005 and 0.02, depending on the parameters and machine CPU, since all the computational parts are written in C.

(In theory, one can port the core of aeneas to C, and from there to JS, via emscripten. It is a huge task, but it would enable decently fast alignment in JS land. Unfortunately, I have not had time/resources to do it.)

BTW, I maintain a list of forced aligners here: https://github.com/pettarin/forced-alignment-tools

Pietro · Answer 8 · Fri Sep 28 2018 21:32:59 GMT+0800 (China Standard Time)

In case anyone is still looking into this turns out that @martymcguire had done a write up where he describe how he modified the BBC News Labs fork of Audiogram to work with Gentle Speech To Text Forced Aligner output, see his repo here.