TextFormatter not working with format_transcripts

Question

TextFormatter not working with format_transcripts

angel-luis opened this issue 9 months ago · comments

To Reproduce

Steps to reproduce the behavior:

transcripts = YouTubeTranscriptApi.get_transcripts(video_ids, languages=['en', 'es'])

formatter = TextFormatter()

formatter.format_transcripts(transcripts)

What code / cli command are you executing?

python my_file.py

Which Python version are you using?

Python 3.11.4

Which version of youtube-transcript-api are you using?

youtube-transcript-api 0.6.1

Expected behavior

The same code is working with JSONFormatter and PrettyPrintFormatter.

Actual behaviour

Instead I received the following error message:

line 71, in <genexpr>
    return '\n'.join(line['text'] for line in transcript)
                     ~~~~^^^^^^^^
TypeError: string indices must be integers, not 'str'

Angel Luis · Answer 1 · Mon Oct 09 2023 21:48:11 GMT+0800 (China Standard Time)

I provide a solution that is working for me:

def format_transcript(self, transcript, **kwargs):
        video_id = list(transcript[0].keys())[0]
        return '\n'.join(line['text'] for line in transcript[0][video_id])

Jonas Depoix · Answer 2 · Sun Oct 15 2023 00:22:22 GMT+0800 (China Standard Time)

Hi @angel-luis,
get_transcripts returns a tuple containing a dict of transcripts and a list of videos which could not be retrieved (({str: [{'text': str, 'start': float, 'end': float}]}, [str]})). However, the param for format_transcripts should be a list of transcripts. So you will have to transform the output of get_transcripts to a list of transcripts before using format_transcript. Like:

transcript_dict, _ = YouTubeTranscriptApi.get_transcripts(video_ids, languages=['en', 'es'])
formatter.format_transcript(transcript_dict.values())

The code which you provided will only format the transcript of the first video in the list. If your list actually just contains one video, you can simply use formatter.format_transcript(YouTubeTranscriptApi.get_transcript(video_ids[0])) instead.

I agree that the docstrings aren't very clear here.

GrimPixel · Answer 3 · Mon Jun 10 2024 01:33:23 GMT+0800 (China Standard Time)

I find it inconvenient here, as JSONFomatter can take in the result of YouTubeTranscriptApi.get_transcripts(video_ids, languages=['en', 'es']), but TextFormatter can't.
I'm still getting TypeError: string indices must be integers, not 'str' and TypeError: list indices must be integers or slices, not str. I think I can get the correct output soon.