jdepoix / youtube-transcript-api

This is a python API which allows you to get the transcript/subtitles for a given YouTube video. It also works for automatically generated subtitles and it does not require an API key nor a headless browser, like other selenium based solutions do!

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TextFormatter not working with format_transcripts

angel-luis opened this issue · comments

To Reproduce

Steps to reproduce the behavior:

transcripts = YouTubeTranscriptApi.get_transcripts(video_ids, languages=['en', 'es'])

formatter = TextFormatter()

formatter.format_transcripts(transcripts)

What code / cli command are you executing?

python my_file.py

Which Python version are you using?

Python 3.11.4

Which version of youtube-transcript-api are you using?

youtube-transcript-api 0.6.1

Expected behavior

The same code is working with JSONFormatter and PrettyPrintFormatter.

Actual behaviour

Instead I received the following error message:

line 71, in <genexpr>
    return '\n'.join(line['text'] for line in transcript)
                     ~~~~^^^^^^^^
TypeError: string indices must be integers, not 'str'

I provide a solution that is working for me:

def format_transcript(self, transcript, **kwargs):
        video_id = list(transcript[0].keys())[0]
        return '\n'.join(line['text'] for line in transcript[0][video_id])

Hi @angel-luis,
get_transcripts returns a tuple containing a dict of transcripts and a list of videos which could not be retrieved (({str: [{'text': str, 'start': float, 'end': float}]}, [str]})). However, the param for format_transcripts should be a list of transcripts. So you will have to transform the output of get_transcripts to a list of transcripts before using format_transcript. Like:

transcript_dict, _ = YouTubeTranscriptApi.get_transcripts(video_ids, languages=['en', 'es'])
formatter.format_transcript(transcript_dict.values())

The code which you provided will only format the transcript of the first video in the list. If your list actually just contains one video, you can simply use formatter.format_transcript(YouTubeTranscriptApi.get_transcript(video_ids[0])) instead.

I agree that the docstrings aren't very clear here.

I find it inconvenient here, as JSONFomatter can take in the result of YouTubeTranscriptApi.get_transcripts(video_ids, languages=['en', 'es']), but TextFormatter can't.
I'm still getting TypeError: string indices must be integers, not 'str' and TypeError: list indices must be integers or slices, not str. I think I can get the correct output soon.