expose functionality for formatting timestamps in textual formats
tomasohara opened this issue · comments
Is your feature request related to a problem? Please describe.
Yes, I am trying to format transcripts similar to what is shown under Youtube.
via https://www.youtube.com/watch?v=vGP4pQdCocw:
Intro
0:00 welcome to the langchang cookbook part 2
0:02 where we're going to cover the nine
0:04 major use cases of Lane chain in part
Describe the solution you'd like
Add options to the constructor(s) so that the timestamps can be added as above.
The functionality is there, but it is just not used in the current code.
via .formatters.py:
class WebVTTFormatter(_TextBasedFormatter):
def _format_timestamp(self, hours, mins, secs, ms):
return "{:02d}:{:02d}:{:02d}.{:03d}".format(hours, mins, secs, ms)
Describe alternatives you've considered
Coding it myself. However, tt seems easier to expose the existing functionality. Plus, I'm guessing someone knows how it is supposed to be put together.
Additional context
Thanks for making this utility available.
@tomasohara You can get what you're looking for right now, if you don't mind using an internal class method, e.g:
# For example, to convert 100.12 to 00:01:40.120 ...
>>> from youtube_transcript_api.formatters import WebVTTFormatter
>>> timestamp_formatter = WebVTTFormatter()
>>> timestamp_formatter._seconds_to_timestamp(100.12)
'00:01:40.120'
The usual caveats apply: The method's not intended for use outside of the class so you do so at your own risk, etc.
Hi @tomasohara, sorry for the late reply!
Since WebVTT is a well defined format, I don't think it should be part of the formatters API to allow for changing that format. However, the interface of the formatters is meant to be easily extensible such that you can easily define your own formatters, or adjust existing ones. The easiest thing you could do in your case is extending the WebVTTFormatter
and overwriting _format_timestamp
:
class CustomDateFormatWebVTTFormatter(WebVTTFormatter):
def _format_timestamp(self, hours, mins, secs, ms):
return "<your-custom-format>".format(hours, mins, secs, ms)
I will close this issue now, as this is the intended solution. Let me know if this doesn't cut it for some reason and I will reopen.
No need to re-open it. Following the tip, here is what I had in mind (in case someone else has a similar request):
class YouTubeLikeFormatter(_TextBasedFormatter):
"""Uses format similar to that under YouTube's Transcript pane:
0:16 This is the city after a storm, ...
0:23 "Once you learn to see as an artist, ..."""
def _format_timestamp(self, hours, mins, secs, _ms):
# format as HH:MM:SS w/ 00 hour omitted and with leading zeros dropped
timestamp = "{:02d}:{:02d}:{:02d}".format(hours, mins, secs)
timestamp = re.sub("^00:", "", timestamp)
timestamp = re.sub("^0", "", timestamp)
return timestamp
def _format_transcript_header(self, lines):
return "Transcript\n\n" + "\n\n".join(lines) + "\n"
def _format_transcript_helper(self, i, time_text, line):
# drops second timestamp (e.g., "00:00:28.500 --> 00:00:30.060" => "00:00:28.500")
time_text = re.sub(r" --> \S+", "", time_text)
return "{} {}".format(time_text, line['text'])
Here's sample usage:
In [1]: import youtube_transcript_api as ytt_api, youtube_transcript_api.formatters as formatters
...:
...: transcript = ytt_api.YouTubeTranscriptApi.get_transcript("3UWxmt7VAlU")
...:
...: print(formatters.YouTubeLikeFormatter().format_transcript(transcript)[: 256])
...:
Transcript
0:16 This is the city after a storm, a scene through the eyes of Wilmington’s own Edward Loper, Sr.,
0:23 "Once you learn to see as an artist, the world will never look the same again."
0:30 Painted in 1937, this moody landscape would become
Something like this make the README clearer than the existing "Provided Formatter Example". (By the way, using a text-based example seems better than JSON to ccount for all the formatting.)
Thanks,
Tom