If you like my code, please

Pyannote plays and Whisper rhymes

Whisper's transcription plus Pyannote's Diarization

Andrej Karpathy suggested training a classifier on top of OpenAI openai/whisper model features to identify the speaker, so we can visualize the speaker in the transcript. But, as pointed out by Christian Perone, it seems that features from whisper wouldn't be that great for speaker recognition as its main objective is basically to ignore speaker differences.

In the following, I use pyannote-audio, a speaker diarization toolkit by Hervé Bredin, to identify the speakers, and then match it with the transcriptions of Whispr. I try it on a part of an interview with Freeman Dyson. Check the result here.

To make it easier to match the transcriptions to diarizations by speaker change, I applied the Sarah Kaiser's suggestion to run pyannote.audio first and then just run whisper on the split-by-speaker chunks.

(For sake of performance , I also tried attaching the audio segements into a single audio file with a silent -or beep- spacer as a seperator, and run whisper on it see it on colab. It works on some audio, and fails on some -e.g. Dyson's Interview. The problem is, whisper does not reliably make a timestap on a spacer. See the discussions #139 and #29)

aascode / nlp-1

Pyannote plays and Whisper rhymes

Whisper's transcription plus Pyannote's Diarization

About

Languages