wa3dbk / ScribeSalad

A collection of YouTube videos transcripts : Podcasts (Joe Rogan Experience, Tim Ferris, Jocko podcast, ..), lectures (YaleCourses, MIT lectures, Jordan B. Peterson talks, ..). A big transcripts salad spanning history, geography, science, politics, film making and more.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Some yt_auto captions all appear within first few seconds of webVTT timestamps

richieM opened this issue · comments

For some english yt_auto transcripts, the entire transcript incorrectly appears within the first few seconds in the webVTT

Some examples:

https://github.com/wa3dbk/ScribeSalad/blob/master/transcripts/en/AndrewHuberman/yt_auto/DTCmprPCDqc.en.vtt
https://github.com/wa3dbk/ScribeSalad/blob/master/transcripts/en/8NewsNowLasVegas/yt_auto/-0IjUVDKY10.en.vtt
https://github.com/wa3dbk/ScribeSalad/blob/master/transcripts/en/GlobalNews/yt_auto/-2Yl-90jzi0.en.vtt

Seems to be a fairly widespread issue.

BTW, thanks for creating this repo, it's very useful :)

Good observation !
Subtitles generated automatically by YouTube (the ones in "yt_auto") are often misaligned, empty or filled with useless tags and symbols (such as [music] and (♪♪)).

I plan on cleaning-up these subtitles (as much as possible) and re-aligning the ones where the entire transcript appears within the first few seconds. This process might take some time (due to the amount of data that needs to be processed).

I'll probably start with videos in English and create a parallel "yt_auto_norm" or "yt_auto_realign" directory containing the new cleaned-up and re-aligned transcripts and work on the remaining languages later.

This process would make the entire repo usable for people interested in ASR (automatic speech recognition) or any kind of search/indexing.