juanmc2005 / diart

A python package to build AI-powered real-time audio applications

Home Page:https://diart.readthedocs.io

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Speaker Identity Resolution

jfernandrezj opened this issue · comments

Thank you very much @juanmc2005 for this library, much much appreciated.
One question I have for this speaker-aware transcription is whether a custom plugin / observer / sink could be implemented for speaker identity resolution, and what would be the best pattern to achieve this.
Ideally on each buffer iteration or speaker change, a speaker resolution prediction, based on a model (probably like faiss / weaviate), could be added either to the rttm or to another file.

Any input would be much appreciated, thank you!

Hi @jfernandrezj, you could try recovering the internal speaker centroids of OnlineSpeakerClustering (centers attribute) to match them with other speakers as you mentioned. For this to work you'd need to use the same embedding model used in diart.

If you want to use a different speaker matching method/model, you can always incorporate it into the pipeline to either replace or complement diart's speaker embedding block, but this could be quite expensive in terms of latency. I would suggest to send audio to a separate speaker matching service and listening to it to label each speaker centroid at display time (e.g. speaker0 -> John).

Thank you very much @juanmc2005