huggingface / speechbox

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Add support for specifying the number of speakers in ASRDiarizationPipeline

Demon-tk opened this issue · comments

Hi @speechbox developers,

I've been using the ASRDiarizationPipeline and noticed that there isn't a built-in option to specify the number of speakers when performing diarization. This feature would be very helpful for scenarios where the number of speakers is already known or can be estimated beforehand, as it can potentially improve the performance of the speaker diarization process.

@Demon-tk If you need a workaround for time being, I was able to make num_speakers, min_speakers, and max_speakers work with following minor change in the diarize.py file -

  • Disable **kwargs from ASR pipeline:
    asr_out = self.asr_pipeline(
    {"array": inputs, "sampling_rate": self.sampling_rate},
    return_timestamps=True #,
    #**kwargs,
    )

Now, include any of these 3 arguments along with the audio file like this:
pipeline = ASRDiarizationPipeline.from_pretrained("openai/whisper-medium", device=device)
out = pipeline(input_vid_path, min_speakers = 2)

Let me know if you have any questions.

@speechbox developers, let me know if you see anything wrong with this workaround. Thanks!

That's a valid workaround - probably what we can do is have specific kwargs for the diarization pipeline, and the asr pipeline

Would you like to open a PR @utility-aagrawal or @Demon-tk to add this support? It would look very similar to specific encoder-decoder kwargs that we have in transformers: https://github.com/huggingface/transformers/blob/dd8b7d28aec80013ad2b25ead4200eea1a6a767e/src/transformers/models/encoder_decoder/modeling_encoder_decoder.py#L458-L464

Thanks @sanchit-gandhi! I can do that for both issues #25 and #27.

@Demon-tk, I have added separate kwargs for asr and diarization pipelines. You should be able to specify number of speakers in the ASRDiarizationPipeline now. Please note that you would need to prefix 'diarization_' to make number of speakers work with diarization pipeline:

pipeline = ASRDiarizationPipeline.from_pretrained("openai/whisper-medium", device=device)
out = pipeline(input_vid_path, diarization_num_speakers = 2)

Please close this thread if there are no further questions/issues. Thanks!