The latency of wespeaker model is to large

Question

The latency of wespeaker model is to large

SheenChi opened this issue 5 months ago · comments

hello @juanmc2005
I use the hbredin/wespeaker-voxceleb-resnet34-LM (ONNX) model to extract speaker embedding in diarization pipeline, but I found the latency is too large(1300ms) when calculate per chunk with the default params (chunk=5s, step=0.5s, latency=0.5), this can not meet the real time requirement.
I found you post the delay performance is 48ms when use cpu and 15ms use gpu. Is there anything I need to pay attention to when reproducing your performance。
Thank you very much for any suggestions

Juan Coria · Answer 1 · Wed Dec 27 2023 22:47:44 GMT+0800 (China Standard Time)

Hi @SheenChi, the values I reported were obtained from the output of diart.stream with my hardware: CPU AMD Ryzen 9 and GPU Nvidia RTX 4060 Max-Q.

If you find the model too slow on your hardware you can try using pyannote/embedding, which is the fastest one. If that's still not enough you could try quantizing a model you like or distilling it into a smaller model. Depending on your hardware, I think distillation would be my preferred choice as a first step, but it requires training.

For training I recommend you use pyannote.audio, as it's very reliable for this use case and would give you instant compatibility with diart