jdepoix / youtube-transcript-api

This is a python API which allows you to get the transcript/subtitles for a given YouTube video. It also works for automatically generated subtitles and it does not require an API key nor a headless browser, like other selenium based solutions do!

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

expose functionality for formatting timestamps in textual formats

tomasohara opened this issue · comments

Is your feature request related to a problem? Please describe.

Yes, I am trying to format transcripts similar to what is shown under Youtube.

via https://www.youtube.com/watch?v=vGP4pQdCocw:

Intro
0:00	welcome to the langchang cookbook part 2
0:02	where we're going to cover the nine
0:04	major use cases of Lane chain in part

Describe the solution you'd like

Add options to the constructor(s) so that the timestamps can be added as above.

The functionality is there, but it is just not used in the current code.

via .formatters.py:

class WebVTTFormatter(_TextBasedFormatter):
    def _format_timestamp(self, hours, mins, secs, ms):
        return "{:02d}:{:02d}:{:02d}.{:03d}".format(hours, mins, secs, ms)

Describe alternatives you've considered

Coding it myself. However, tt seems easier to expose the existing functionality. Plus, I'm guessing someone knows how it is supposed to be put together.

Additional context

Thanks for making this utility available.

@tomasohara You can get what you're looking for right now, if you don't mind using an internal class method, e.g:

# For example, to convert 100.12 to 00:01:40.120 ...

>>> from youtube_transcript_api.formatters import WebVTTFormatter

>>> timestamp_formatter = WebVTTFormatter()
>>> timestamp_formatter._seconds_to_timestamp(100.12)
'00:01:40.120'

The usual caveats apply: The method's not intended for use outside of the class so you do so at your own risk, etc.

Hi @tomasohara, sorry for the late reply!

Since WebVTT is a well defined format, I don't think it should be part of the formatters API to allow for changing that format. However, the interface of the formatters is meant to be easily extensible such that you can easily define your own formatters, or adjust existing ones. The easiest thing you could do in your case is extending the WebVTTFormatter and overwriting _format_timestamp:

class CustomDateFormatWebVTTFormatter(WebVTTFormatter):
    def _format_timestamp(self, hours, mins, secs, ms):
        return "<your-custom-format>".format(hours, mins, secs, ms)

I will close this issue now, as this is the intended solution. Let me know if this doesn't cut it for some reason and I will reopen.

No need to re-open it. Following the tip, here is what I had in mind (in case someone else has a similar request):

class YouTubeLikeFormatter(_TextBasedFormatter):
    """Uses format similar to that under YouTube's Transcript pane:
         0:16 This is the city after a storm, ...
         0:23 "Once you learn to see as an artist, ..."""
    
    def _format_timestamp(self, hours, mins, secs, _ms):
        # format as HH:MM:SS w/ 00 hour omitted and with leading zeros dropped
        timestamp = "{:02d}:{:02d}:{:02d}".format(hours, mins, secs)
        timestamp = re.sub("^00:", "", timestamp)
        timestamp = re.sub("^0", "", timestamp)
        return timestamp

    def _format_transcript_header(self, lines):
        return "Transcript\n\n" + "\n\n".join(lines) + "\n"

    def _format_transcript_helper(self, i, time_text, line):
        # drops second timestamp (e.g., "00:00:28.500 --> 00:00:30.060" => "00:00:28.500")
        time_text = re.sub(r" --> \S+", "", time_text)
        return "{} {}".format(time_text, line['text'])

Here's sample usage:

In [1]: import youtube_transcript_api as ytt_api, youtube_transcript_api.formatters as formatters
   ...: 
   ...: transcript = ytt_api.YouTubeTranscriptApi.get_transcript("3UWxmt7VAlU")
   ...: 
   ...: print(formatters.YouTubeLikeFormatter().format_transcript(transcript)[: 256])
   ...: 
Transcript

0:16 This is the city after a storm, a scene through the eyes of Wilmington’s own Edward Loper, Sr., 

0:23 "Once you learn to see as an artist, the world will never look the same again."

0:30 Painted in 1937, this moody landscape would become

Something like this make the README clearer than the existing "Provided Formatter Example". (By the way, using a text-based example seems better than JSON to ccount for all the formatting.)

Thanks,
Tom