smacke / ffsubsync

Automagically synchronize subtitles with video.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Awesome idea, room for improvement

pannal opened this issue · comments

This is extremely interesting.

I'm the author of Sub-Zero for Plex and a contributor of Bazarr, which take a different approach to "syncing" subtitles to media by matching the release and media metadata to user generated content.

The subsync approach is one I wanted to do for ages.

Sub-Zero (and, at its core, subliminal) relies on proper offset alignment based on user generated content - this works 95% of the time, as subtitles are usually created for a specific release of a media file.

Being able to sync a subtitle based on its start point is awesome, but it's at least as error prone compared to the matching of metadata and relying on the subtitle author to not mess up.

There are a couple of factors that would need to be addressed before subsync could be integrated into one of the existing solutions (just off the top of my head):

  • hearing impaired tags mess up the detection
  • commercial breaks and cuts distort the timing of subtitles even after the initial sub entry has been synced correctly
  • multiline subtitle entries can mess up the entry point of subsync and subsequent entries

I massively applaud the effort and I hope this can be extended to do a lot more and be a lot more accurate.

I massively applaud the effort and I hope this can be extended to do a lot more and be a lot more accurate.

Thank you very much, and I am 100% in agreement. There are a few limitations to the current approach that restrict its applicability. Of the three you listed, I believe the 2nd bullet is the most problematic -- different cuts, or breaks anywhere except for the beginning/end of the video are fundamentally impossible to reconcile with the current strategy. Another situation where the current approach will definitely barf is when the video and subtitles move at different FPS. Finally, based on personal experience, sometimes (but certainly not always) the final alignment is still off by ~1s -- it would be great to diagnose the issue and go the last mile to get the error down to tens of milliseconds.

Because of the way the implementation works, I actually disagree that the 1st and 3rd bullets (presence of hearing-impaired tags, multiline subtitle entries) will create significant issues, but the academic in me recognizes the need to back up that claim with empirical results.

I wasn't aware of Sub-Zero or Bazarr before (or Plex even). All tools look awesome. I really like the idea of circumventing subtitle issues by automatically grabbing the right file for the currently-playing video, and I bet this approach could be used in conjunction with subsync's alignment scoring algorithm.

I'll leave this open as a feature enhancement. I see at least 2 or 3 different directions presenting themselves, so hopefully this can be a good spot for others to weigh in with comments / insights.

Thank you all for working on this subject. It's annoying hundred thousands of people (looks quite a fun field, too).

I have an idea about the "commercial cuts" problem you surely have already thought and probably has been already tried; it's probably a usual "pattern":

Doing an iteration of the max value of the convolution from the beginning to some point, each time adding e.g. 5% of the total time of the video, storing each result (the displacement which maximizes the match and the value of that max).
Once we have the whole discrete data, some kind of cumulative of the original function, we could try to detect when the max begins to drop because of the appearance of a commercial cut.

That's the initial idea.

I imagine then something could be done like doing the same again but with a version of the subtitles where, starting somewhere where we detected the beggining of the commercial cut (the drop of the cumulative max), we had displaced them forward a certain amount.
Then when we search those new results till we find which displacement of the subtitles at the commercial cut results in the least dropping of the resulting cumulative data (maybe approximated by just maximizing it's area).

The same for each other "extraneous" cut in the video.

All of this with optimizations (e.g. we know the cut won't last 20 mins, so we do not have into account farther than that after each possible commercial cut detected).

(Thanks to aabeshou from Hacker News where I've read his explanation of the FFT approach).

@rmorenobello I've been thinking along similar lines. I think there might be a way to avoid trying all strides of e.g. 5%, but generally I'm optimistic that an approach in this spirit should succeed.

I think a prerequisite to starting on this is real data, or at least a way to manufacture videos with 1, 2, etc. breaks. Probably the synthetic route is easier and shouldn't be too hard. When I have time, I'll generate train / test sets -- try to get the algo working on the training set, and once it works, determine whether the functionality correctly carries over to the test data.

Anyway the case of the movie having commercials must be quite rare so maybe it's not worth it.
But could be usefull for inserted cuts in directors cut editions :)

Another situation where the current approach will definitely barf is when the video and subtitles move at different FPS.

This may be easier to solve for in most cases than one might initially think, since the FPS mismatch rate is very probably one of only 6 particular values, namely 1001/1000, 25/24, 25/23.976 or one of their reciprocals (see https://github.com/oseiskar/autosubsync#speed-correction). You could try just finding the optimal offset each of these 6 speeds and pick the speed & offset combination with the best score.

Great project by the way! Especially using the off-the-shelf VAD instead of trying to roll your own (like me and the others earlier) is quite elegant.

Concerning problem 2 with "commercial breaks". I wrote, funnily also named subsync, a while back. I performed a recursive matching algorithm, splitting the audio/subtitle track into two halves each time, performing smaller and smaller adjustments. Following my experiments it worked quite well for this problem. This however increases the complexity and therefore the running time. By only allowing a shift of 5-10% of the audio length (not every possible offset) I reduced the running time quite a bit.

Have there been any improvements to the codebase regarding commercial breaks and matching granularity, recently? Thank you!

No improvements to the codebase at the moment, but I have some thoughts on an algorithm. Before I test the idea out, I need to put together a benchmark -- see #31

@tympanix It seems there are a few projects (including yours) that had already laid claim to the name "subsync". I should have done a better job Googling before picking the name 😅

I'm curious to hear more about your approach. In your experience, did it typically achieve perfect synchronization when breaks / splits were present in the video / subtitles? I've been playing with some ideas but it would be good to hear more from folks with more experience before potentially going down a dead end.

@smacke no worries about the name from my side of things.

The approach worked very well for me with the tests I did. The matching was very reliable even with multiple commercial breaks. The trick that did it was adjusting the displacement sensitivity to the right value. Else a sentence from somewhere in the subtitle might displace to whole different part of the video. I did this by only allowing a displacement of 5-10%, hence when you are at a large scale (i.e. adjusting a sequence of multiple sentences) you allow for coarse adjustments, while at a small scale (i.e. a single sentence) you only allow for very fine tuned adjustments. I performed the approach recursively starting with an adjustment of the subtitle as a whole going down to a single sentence. Lastly I performed some cleanup for sentences that might be overlapping slightly.

With the benchmarking you have mentioned it will also be a lot easier to fine tune the parameters for optimal performance. It might also work even better with the VAD from WebRTC used in this project.

commented

I've been developing https://github.com/kaegi/aligner on my local machine to include the same VAD module, so that subtitle-to-video is possible too (currently changes are not published). It works pretty well in my tests.

The alignment algorithm takes about 6-8s on 2h movies with 1300 subtitle lines. It finds the perfect alignment to some (relatively easy to understand) metric, where each splits reduces lowers the rating for the given alignment. If you lower the split-penalty it can even correct the framerate difference because it automatically finds that splitting the movies in 3-4 (almost) equal parts with slightly different offsets optimizes the alignment rating.

The algorithm is invariant under start positions (it takes the same amount of time and returns the same result if all subtitle lines are moved by 10 hours) and takes roughly the same time, no matter how many splits have to be introduced.

A simpler version of the algorithm, which uses the same metric without splits, terminates in the new version within a second.

commented

For anyone interested: you can now find the new version of aligner (now called alass to be able to find it on search engines) here.

How about the following idea.

You let google generate an automatic subtitle in your language.
You sync the existing subtitle, to the automatic generated subtitle by google.

This could also help with splits etc

How about the following idea.

You let google generate an automatic subtitle in your language.
You sync the existing subtitle, to the automatic generated subtitle by google.

This could also help with splits etc

This project is similar to what your are suggesting it generates and translates then syncs the existing subtitle doesn't use google though but a local engine for the languages available.
https://github.com/sc0ty/subsync
Another project uses google for transcribing and translation but does not sync
https://github.com/BingLingGroup/autosub

This script seems to be able to handle some of the features which this project is missing. Maybe worth taking a look at it or collaborate with him to create the perfect sub sync solution https://github.com/kaegi/alass

see also

  • whisper - Robust Speech Recognition via Large-Scale Weak Supervision
  • openai/whisper#1770
  • WhisperTimeSync - Synchronize Whisper's timestamps over an existing accurate transcription
  • aeneas - a Python/C library and a set of tools to automagically synchronize audio and text (aka forced alignment)
    • also has the usual limitation: Audio should match the text: large portions of spurious text or audio might produce a wrong sync map
    • example use: aeneas_execute_task audiotrack.flac input.srt 'task_language=eng|is_text_type=subtitles|os_task_file_format=srt' output.srt (this fails, because aeneas cannot read SRT files, it expects a TXT file without timestamps)
  • forced alignment