Add support for specifying the number of speakers in ASRDiarizationPipeline

Question

Add support for specifying the number of speakers in ASRDiarizationPipeline

Demon-tk opened this issue a year ago · comments

Hi @speechbox developers,

I've been using the ASRDiarizationPipeline and noticed that there isn't a built-in option to specify the number of speakers when performing diarization. This feature would be very helpful for scenarios where the number of speakers is already known or can be estimated beforehand, as it can potentially improve the performance of the speaker diarization process.

Patrick von Platen · Answer 1 · Thu Jun 15 2023 16:20:12 GMT+0800 (China Standard Time)

cc @sanchit-gandhi

utility-aagrawal · Answer 2 · Thu Aug 24 2023 05:06:33 GMT+0800 (China Standard Time)

@Demon-tk If you need a workaround for time being, I was able to make num_speakers, min_speakers, and max_speakers work with following minor change in the diarize.py file -

Disable **kwargs from ASR pipeline:
asr_out = self.asr_pipeline(
{"array": inputs, "sampling_rate": self.sampling_rate},
return_timestamps=True #,
#**kwargs,
)

Now, include any of these 3 arguments along with the audio file like this:
pipeline = ASRDiarizationPipeline.from_pretrained("openai/whisper-medium", device=device)
out = pipeline(input_vid_path, min_speakers = 2)

Let me know if you have any questions.

@speechbox developers, let me know if you see anything wrong with this workaround. Thanks!

Sanchit Gandhi · Answer 3 · Fri Aug 25 2023 21:26:09 GMT+0800 (China Standard Time)

That's a valid workaround - probably what we can do is have specific kwargs for the diarization pipeline, and the asr pipeline

Would you like to open a PR @utility-aagrawal or @Demon-tk to add this support? It would look very similar to specific encoder-decoder kwargs that we have in transformers: https://github.com/huggingface/transformers/blob/dd8b7d28aec80013ad2b25ead4200eea1a6a767e/src/transformers/models/encoder_decoder/modeling_encoder_decoder.py#L458-L464

utility-aagrawal · Answer 4 · Fri Aug 25 2023 21:34:18 GMT+0800 (China Standard Time)

Thanks @sanchit-gandhi! I can do that for both issues #25 and #27.

utility-aagrawal · Answer 5 · Mon Aug 28 2023 23:01:58 GMT+0800 (China Standard Time)

@Demon-tk, I have added separate kwargs for asr and diarization pipelines. You should be able to specify number of speakers in the ASRDiarizationPipeline now. Please note that you would need to prefix 'diarization_' to make number of speakers work with diarization pipeline:

pipeline = ASRDiarizationPipeline.from_pretrained("openai/whisper-medium", device=device)
out = pipeline(input_vid_path, diarization_num_speakers = 2)

Please close this thread if there are no further questions/issues. Thanks!