nypublicradio / audiogram

Turn audio into a shareable video.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Better captioning

veltman opened this issue · comments

Have a mostly-working branch that allows for entering and positioning multiple captions, but the manual entry/interface is a real drag, especially for a long video. Worth exploring some improvements.

Forced aligners?

Using a forced aligner like Gentle to take a bulk transcript and automatically time it to the audio would help - then you could type in the whole thing (or paste from a transcript) and it could automatically break it into chunks.

Pros: Much faster if you have a full transcript already (paste the whole thing rather than pasting line-by-line and tweaking the timing).
Cons: Not much faster if you don't have a transcript. A lot more code complexity (all the OSS aligners seem to be Python). Would probably still need to tweak the captions into sensible breaks (e.g. avoid orphan words).

Auto transcribe

Use some sort of speech-to-text to take a first pass at transcribing the audio. In-browser options include PocketSphinx and the Web Speech API in certain browsers. Server-side options include normal Sphinx or the Watson API.

Pros: Great when it works.
Cons: Doesn't always work, especially for non-English languages or clips with music, background noise, etc. Still doesn't work out timing. If it's server-side, would require a second round-trip before the form submission. Could take a long time for long pieces of audio.

Parse timestamped transcripts?

Could allow people to upload an SRT or some other timecoded transcript format in the editor. The parsing wouldn't be that hard, but it's unclear how often audio orgs use these.

Looks like the Web Speech API doesn't provide any way to connect it to a non-mic source, but PocketSphinx does (with some fiddling).

you could also use other APIs like speechmatics (https://speechmatics.com/), or https://cloud.google.com/speech/ ?

Yup, true - though I'm a little reluctant to rely on an external API rather than something that can be bundled (ditto Watson).

Hey @veltman,
Gentle could be modified to generate a transcription when the text is not available. This already works in the REST API, see the curl example if you don't pass the text file it returns a transcription. but it doesn't work in the python terminal command. The code would need to be modified accordingly, which is something I am looking into.

I also played a round with pocket sphinx, packaging it as a node module https://github.com/OpenNewsLabs/offline_speech_to_text.
I extracted it from video grep electron app.

Considering that the effective maximum on social media is 30s, I think that expecting users to supply a transcript is absolutely fine.

It doesn't scale to generating complete videos from long-form shows, but I think that's acceptable - it's still a big benefit for most uses.

I'm a one-person band working on my own community/radio niche narrative history series, and I've used SRT, using a free online manual transcriber (called, originally enough, "Transcriber"). Though I'm about as unrepresentative as you can possibly get.

For the srt option I've wrote an srt parse composer that is also on npm.

Can be used to parse the srt into a word accurate json (original code to make it word accurate is from popcorn js srt parsing module parser also on github) with that is possible to make a "hyper transcript" where the user can make word accurate selections. I've done something similar in quickQuote (now refactored in node and in autoEdit) inspired by the hyperaudio project.

Shameless plug, I hope you find it informative.

I maintain a Python/C forced aligner called aeneas ( http://www.readbeyond.it/aeneas/ and https://github.com/readbeyond/aeneas/ ). Its approach is not based on speech recognition (like Gentle and basically all other forced aligners out there), but on an older technique known as Dynamic Time Warping. It works decently well (and much faster) if you align text at sentence/phrase level, but it is worse at word-level. Its real time factor (ratio between processing time and real audio length) is between 0.005 and 0.02, depending on the parameters and machine CPU, since all the computational parts are written in C.

(In theory, one can port the core of aeneas to C, and from there to JS, via emscripten. It is a huge task, but it would enable decently fast alignment in JS land. Unfortunately, I have not had time/resources to do it.)

BTW, I maintain a list of forced aligners here: https://github.com/pettarin/forced-alignment-tools

In case anyone is still looking into this turns out that @martymcguire had done a write up where he describe how he modified the BBC News Labs fork of Audiogram to work with Gentle Speech To Text Forced Aligner output, see his repo here.