jdepoix / youtube-transcript-api

This is a python API which allows you to get the transcript/subtitles for a given YouTube video. It also works for automatically generated subtitles and it does not require an API key nor a headless browser, like other selenium based solutions do!

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Enhance Timestamp Duration for More Content in Transcript Segments

chiragksharma opened this issue · comments

Discussed in #250

Originally posted by chiragksharma January 18, 2024

Issue Description

Currently, the transcript segments generated from YouTube videos include timestamps that are too brief, often containing only a single line or a very short paragraph of text. This format leads to an excessive number of timestamps with minimal content under each, which can be inconvenient for users seeking more substantial information per segment.

Example of Current Output

[0.16] just one year after launch chat GPT has
[2.56] well over 100 million daily active users
[5.279] and next week open AI is opening the
[7.04] floodgates allowing developers to profit
[9.12] on their platform by selling custom GPT
11.44] agents all you have to do is convince 1%
[13.48] of the user base to pay you $1 per month

Desired Output

[00:00](https://www.youtube.com/watch?v=undefined&t=0s)
I'm Arvind Srinivas. I'm the co-founder and CEO of perplexity AI. Perplexity is a conversational and search engine that aims to deliver answers to you, to whatever questions you may ask. We are trying to revolutionize how people consume information online. Instead of getting ten blue links, they can just ask questions in natural language and just get it answered instantly.
[00:19](https://www.youtube.com/watch?v=undefined&t=19s)
And we launch the product on December 7th, 2022. We have like about 10 million monthly active users at this point. It's basically grown thousand X over a period of one year. So I grew up in India, studied in one of the IIT's there, and I was really into algorithms programing ever since the beginning.
[00:47](https://www.youtube.com/watch?v=undefined&t=47s)
A friend of mine told me about a machine learning contest, which I didn't even know what machine learning was, what? All they told me was, hey, there's this data set and you can figure out a way to predict the output given the input. And it was fun. And I won the contest and I didn't spend a lot of time on it, and it came more naturally.
[01:02](https://www.youtube.com/watch?v=undefined&t=62s)
So I decided to go deeper into it. And I went and did my PhD in Berkeley on AI and deep learning. I worked at OpenAI in 2018 summer as a research intern. I thought I was good, okay, I did really well in India. I came to Berkeley. I'm like, definitely one of the top AI PhD students. And then I went to OpenAI and I felt like really bad because people were so much better than me.
[01:22](https://www.youtube.com/watch?v=undefined&t=82s)
It was a big reality check that, okay, I could improve a lot more in programing. I could improve a lot more in first principles. Thinking my clarity of thoughts. After an internship at OpenAI in 2018, that was when GPT 1 was published. We realized that there is this new form of learning using all the internet data and learning from it, and I figured that was going to be more important.
[01:41](https://www.youtube.com/watch?v=undefined&t=101s)
So I told my advisor that this is the right thing to do. We should go work on this. And he was actually like pretty open minded and said, okay, you know what? Like I'm not a specialist here, but let's try. I mean, if this is the next thing, the best way to learn a new topic is to force yourself to teach it to others.

Can we get this link as well so that whenever the user clicks, it redirects to that time stamp.

I am building a flask app using this api and a beginner in writing code. Please help me solve this problem, been stuck on it for a long time.

My current code

def get_transcript(video_id):
    transcript_list = YouTubeTranscriptApi.get_transcript(video_id,languages=['en', 'hi'])
    transcript = ''
    for segment in transcript_list:
    
        start_time = segment['start']
        transcript += f"[{start_time}] {segment['text']}\n\n"

    return transcript

Hi @chiragksharma,
I am sorry but this is neither an issue with this module nor a feature request, it is more of a general programming question, therefore I will close this issue. It is okay to be discussion in the discussions section, or you could ask on StackOverflow, as this really is not related to this module. However, you should be able to find all the information you need to implement this on StackOverflow, without creating a new question.

Hey @jdepoix
Thank you for your response and guidance regarding the handling of my query. I understand that my question falls outside the typical issues or feature requests for this module. However, I'd like to propose a documentation enhancement based on my recent experience with the module.

Suggestion:

I propose adding an example of a custom formatter that allows users to adjust the time duration of transcript segments in the documentation. This addition could serve as a valuable reference for future users looking to customize their transcript formats beyond the standard options.

Contribution:

I have already implemented such a function in my project and would be happy to contribute this to the documentation or the main formatters.py file. I believe this could save time for others facing similar requirements.

Could you please let me know if adding this as a pull request would be appropriate, and if so, would you prefer it in the documentation or the codebase?

Looking forward to your thoughts on this.