modelscope / 3D-Speaker

A Repository for Single- and Multi-modal Speaker Verification, Speaker Recognition and Speaker Diarization

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Inference acceleration

yangyyt opened this issue · comments

When applying the module of speaker classification, hundreds of millions of data inference, how to perform batch inference when vad, extraction embedding.

Thanks to the author for his reply and suggestions

Currently, batch processing is not supported, only multi-process and multi-GPU processing are supported.

Currently, batch processing is not supported, only multi-process and multi-GPU processing are supported.

Thanks, The model of Chinese-English mixed speaker diarization, does ModelScope support it?,I don't seem to see this model on the web side.
https://github.com/alibaba-damo-academy/3D-Speaker/tree/main/egs/3dspeaker/speaker-diarization

or
Will there be plans for open-source batch inference in the future?

I tried the example under ModelScope 【https://www.modelscope.cn/models/iic/speech_campplus_speaker-diarization_common/summary】, and found that the same audio, the same speaker extracted the embedding model, and the 3D-Speaker could successfully judge multiple people, but ModelScope could not, and I suspected that ModelScope Pipeline and 3DSpeaker may have some parameters or logic inconsistencies. But it seems that ModelScope doesn't support batch inference either.

Currently, batch processing is not supported, only multi-process and multi-GPU processing are supported.

Thanks, The model of Chinese-English mixed speaker diarization, does ModelScope support it?,I don't seem to see this model on the web side. https://github.com/alibaba-damo-academy/3D-Speaker/tree/main/egs/3dspeaker/speaker-diarization

or Will there be plans for open-source batch inference in the future?

You can change the value of variable “speaker_model_id” to iic/speech_campplus_sv_zh_en_16k-common_advanced to use Chinese-English mixed model.
We will support batch inference. @yangyyt

I tried the example under ModelScope 【https://www.modelscope.cn/models/iic/speech_campplus_speaker-diarization_common/summary】, and found that the same audio, the same speaker extracted the embedding model, and the 3D-Speaker could successfully judge multiple people, but ModelScope could not, and I suspected that ModelScope Pipeline and 3DSpeaker may have some parameters or logic inconsistencies. But it seems that ModelScope doesn't support batch inference either.

The inference processes of the two are almost consistent, except for minor differences. If there are significantly different outputs, the input may be short audio. Since the current pipeline is not robust for recognizing short audio, it is recommended to use longer audio(>1min). @yangyyt