Pyannote Audio Overlapped Speech Detection

Convert pyannote-audio's overlapped speech detection pipeline to C++.

Whole pipeline is splitted into 2 stages,

segment
binarize

Segment
Export segmentation model from python which is later used to do inference in C++ with onnxruntime.

Binarize
Convert all python code to C++ code.

Model

Export model

segmenation

script: segment/export3.py

Above script will export model and verify model using example wav data.

Create conda evironment or use python venv module.

$> conda create --name sd_embeddings
$> conda activate sd_embeddings
$> cd segment
$> python export3.py

Dependencies

onnxruntime
liborch onnxruntime is used to inference for Segmentation.

Note: even verification passed in python code - export2.py, there are still tiny difference between result gererated by python onnxruntime and onnxruntime C++. For embeddings, absolute difference is 0.02. relative difference is 0.1, since clustering stage is very senstive to tiny difffernce which will leads to different result compared with result generated by pyannote-audio.

Build

$> cd pipline
$> mkdir build && cd build
$> cmake ..
$> make

If want build GPU version

$> cmake -DGPU=ON ..
$> make

Run

run on CPU

$> ./olSpeechDetection [model] [wav file] 
$> ./olSpeechDetection ../model/segment.onnx ../data/multi-speaker_4-speakers_Jennifer_Aniston_and_Adam_Sandler_talk.wav

run on GPU

$> ./olSpeechDetection [model] [wav file] GPU
$> ./olSpeechDetection ../model/segment.onnx ../data/multi-speaker_4-speakers_Jennifer_Aniston_and_Adam_Sandler_talk.wav GPU

Result

Our Output	Pyannote Output
[ 00:00:01.834 --> 00:00:02.022 ] [ 00:00:10.571 --> 00:00:11.186 ] [ 00:00:20.315 --> 00:00:21.117 ] [ 00:00:26.749 --> 00:00:27.909 ] [ 00:00:31.100 --> 00:00:31.970 ] [ 00:00:35.554 --> 00:00:35.895 ] [ 00:00:38.131 --> 00:00:38.421 ] [ 00:00:39.257 --> 00:00:40.349 ] [ 00:00:40.486 --> 00:00:45.691 ] [ 00:01:04.769 --> 00:01:04.991 ] [ 00:01:35.008 --> 00:01:35.401 ] [ 00:01:36.424 --> 00:01:37.295 ] [ 00:01:42.244 --> 00:01:43.831 ] [ 00:01:45.025 --> 00:01:46.595 ] [ 00:01:51.561 --> 00:01:52.534 ] [ 00:02:00.383 --> 00:02:00.981 ] [ 00:02:11.783 --> 00:02:12.346 ] [ 00:02:16.868 --> 00:02:18.199 ] [ 00:02:20.179 --> 00:02:20.708 ] [ 00:02:28.011 --> 00:02:29.052 ] [ 00:02:35.298 --> 00:02:36.612 ] [ 00:03:10.315 --> 00:03:14.257 ] [ 00:03:26.271 --> 00:03:26.561 ] [ 00:03:27.500 --> 00:03:28.284 ] [ 00:03:39.138 --> 00:03:39.360 ] [ 00:03:41.220 --> 00:03:44.445 ] [ 00:03:54.274 --> 00:03:54.650 ] [ 00:03:57.687 --> 00:03:59.291 ] [ 00:04:02.482 --> 00:04:03.626 ] [ 00:04:06.783 --> 00:04:08.063 ] [ 00:04:15.588 --> 00:04:18.233 ] [ 00:04:18.933 --> 00:04:19.820 ] [ 00:04:27.073 --> 00:04:30.674 ] [ 00:04:39.581 --> 00:04:40.759 ] [ 00:05:02.397 --> 00:05:02.755 ] [ 00:05:04.018 --> 00:05:04.377 ] [ 00:05:10.947 --> 00:05:12.602 ] [ 00:05:13.046 --> 00:05:16.322 ] [ 00:05:17.568 --> 00:05:18.831 ] [ 00:05:20.639 --> 00:05:21.834 ] [ 00:05:30.571 --> 00:05:31.151 ] [ 00:05:33.114 --> 00:05:33.284 ] [ 00:05:36.356 --> 00:05:39.854 ] [ 00:05:40.571 --> 00:05:41.339 ] [ 00:05:42.124 --> 00:05:43.779 ] [ 00:05:44.837 --> 00:05:45.332 ] [ 00:05:47.056 --> 00:05:47.926 ] [ 00:05:50.281 --> 00:05:51.117 ] [ 00:05:52.994 --> 00:05:54.854 ] [ 00:06:19.155 --> 00:06:19.462 ] [ 00:06:25.469 --> 00:06:26.237 ] [ 00:06:27.807 --> 00:06:28.677 ] [ 00:06:30.861 --> 00:06:31.493 ] [ 00:06:32.175 --> 00:06:32.414 ] [ 00:06:38.575 --> 00:06:39.257 ] [ 00:06:43.216 --> 00:06:45.110 ] [ 00:06:47.295 --> 00:06:48.404 ] [ 00:07:07.278 --> 00:07:08.336 ] [ 00:07:09.121 --> 00:07:09.991 ] [ 00:07:10.571 --> 00:07:10.930 ] [ 00:07:19.462 --> 00:07:19.889 ] [ 00:07:25.230 --> 00:07:25.503 ]	[ 00:00:01.834 --> 00:00:02.022] [ 00:00:10.571 --> 00:00:11.186] [ 00:00:20.315 --> 00:00:21.117] [ 00:00:26.749 --> 00:00:27.909] [ 00:00:31.100 --> 00:00:31.970] [ 00:00:35.554 --> 00:00:35.895] [ 00:00:38.131 --> 00:00:38.421] [ 00:00:39.257 --> 00:00:40.349] [ 00:00:40.486 --> 00:00:45.691] [ 00:01:04.769 --> 00:01:04.991] [ 00:01:35.008 --> 00:01:35.401] [ 00:01:36.424 --> 00:01:37.295] [ 00:01:42.244 --> 00:01:43.831] [ 00:01:45.025 --> 00:01:46.595] [ 00:01:51.561 --> 00:01:52.534] [ 00:02:00.383 --> 00:02:00.981] [ 00:02:11.783 --> 00:02:12.346] [ 00:02:16.868 --> 00:02:18.199] [ 00:02:20.179 --> 00:02:20.708] [ 00:02:28.011 --> 00:02:29.052] [ 00:02:35.298 --> 00:02:36.612] [ 00:03:10.315 --> 00:03:14.257] [ 00:03:26.271 --> 00:03:26.561] [ 00:03:27.500 --> 00:03:28.284] [ 00:03:39.138 --> 00:03:39.360] [ 00:03:41.220 --> 00:03:44.445] [ 00:03:54.274 --> 00:03:54.650] [ 00:03:57.687 --> 00:03:59.291] [ 00:04:02.482 --> 00:04:03.626] [ 00:04:06.783 --> 00:04:08.063] [ 00:04:15.588 --> 00:04:18.233] [ 00:04:18.933 --> 00:04:19.820] [ 00:04:27.073 --> 00:04:30.674] [ 00:04:39.581 --> 00:04:40.759] [ 00:05:02.397 --> 00:05:02.755] [ 00:05:04.018 --> 00:05:04.377] [ 00:05:10.947 --> 00:05:12.602] [ 00:05:13.046 --> 00:05:16.322] [ 00:05:17.568 --> 00:05:18.831] [ 00:05:20.639 --> 00:05:21.834] [ 00:05:30.571 --> 00:05:31.151] [ 00:05:33.114 --> 00:05:33.284] [ 00:05:36.356 --> 00:05:39.854] [ 00:05:40.571 --> 00:05:41.339] [ 00:05:42.124 --> 00:05:43.779] [ 00:05:44.837 --> 00:05:45.332] [ 00:05:47.056 --> 00:05:47.926] [ 00:05:50.281 --> 00:05:51.117] [ 00:05:52.994 --> 00:05:54.854] [ 00:06:19.155 --> 00:06:19.462] [ 00:06:25.469 --> 00:06:26.237] [ 00:06:27.807 --> 00:06:28.677] [ 00:06:30.861 --> 00:06:31.493] [ 00:06:32.175 --> 00:06:32.414] [ 00:06:38.575 --> 00:06:39.257] [ 00:06:43.216 --> 00:06:45.110] [ 00:06:47.295 --> 00:06:48.404] [ 00:07:07.278 --> 00:07:08.336] [ 00:07:09.121 --> 00:07:09.991] [ 00:07:10.571 --> 00:07:10.930] [ 00:07:19.462 --> 00:07:19.889] [ 00:07:25.230 --> 00:07:25.503]

Our Output

Pyannote Output

[ 00:00:01.834 --> 00:00:02.022 ]
[ 00:00:10.571 --> 00:00:11.186 ]
[ 00:00:20.315 --> 00:00:21.117 ]
[ 00:00:26.749 --> 00:00:27.909 ]
[ 00:00:31.100 --> 00:00:31.970 ]
[ 00:00:35.554 --> 00:00:35.895 ]
[ 00:00:38.131 --> 00:00:38.421 ]
[ 00:00:39.257 --> 00:00:40.349 ]
[ 00:00:40.486 --> 00:00:45.691 ]
[ 00:01:04.769 --> 00:01:04.991 ]
[ 00:01:35.008 --> 00:01:35.401 ]
[ 00:01:36.424 --> 00:01:37.295 ]
[ 00:01:42.244 --> 00:01:43.831 ]
[ 00:01:45.025 --> 00:01:46.595 ]
[ 00:01:51.561 --> 00:01:52.534 ]
[ 00:02:00.383 --> 00:02:00.981 ]
[ 00:02:11.783 --> 00:02:12.346 ]
[ 00:02:16.868 --> 00:02:18.199 ]
[ 00:02:20.179 --> 00:02:20.708 ]
[ 00:02:28.011 --> 00:02:29.052 ]
[ 00:02:35.298 --> 00:02:36.612 ]
[ 00:03:10.315 --> 00:03:14.257 ]
[ 00:03:26.271 --> 00:03:26.561 ]
[ 00:03:27.500 --> 00:03:28.284 ]
[ 00:03:39.138 --> 00:03:39.360 ]
[ 00:03:41.220 --> 00:03:44.445 ]
[ 00:03:54.274 --> 00:03:54.650 ]
[ 00:03:57.687 --> 00:03:59.291 ]
[ 00:04:02.482 --> 00:04:03.626 ]
[ 00:04:06.783 --> 00:04:08.063 ]
[ 00:04:15.588 --> 00:04:18.233 ]
[ 00:04:18.933 --> 00:04:19.820 ]
[ 00:04:27.073 --> 00:04:30.674 ]
[ 00:04:39.581 --> 00:04:40.759 ]
[ 00:05:02.397 --> 00:05:02.755 ]
[ 00:05:04.018 --> 00:05:04.377 ]
[ 00:05:10.947 --> 00:05:12.602 ]
[ 00:05:13.046 --> 00:05:16.322 ]
[ 00:05:17.568 --> 00:05:18.831 ]
[ 00:05:20.639 --> 00:05:21.834 ]
[ 00:05:30.571 --> 00:05:31.151 ]
[ 00:05:33.114 --> 00:05:33.284 ]
[ 00:05:36.356 --> 00:05:39.854 ]
[ 00:05:40.571 --> 00:05:41.339 ]
[ 00:05:42.124 --> 00:05:43.779 ]
[ 00:05:44.837 --> 00:05:45.332 ]
[ 00:05:47.056 --> 00:05:47.926 ]
[ 00:05:50.281 --> 00:05:51.117 ]
[ 00:05:52.994 --> 00:05:54.854 ]
[ 00:06:19.155 --> 00:06:19.462 ]
[ 00:06:25.469 --> 00:06:26.237 ]
[ 00:06:27.807 --> 00:06:28.677 ]
[ 00:06:30.861 --> 00:06:31.493 ]
[ 00:06:32.175 --> 00:06:32.414 ]
[ 00:06:38.575 --> 00:06:39.257 ]
[ 00:06:43.216 --> 00:06:45.110 ]
[ 00:06:47.295 --> 00:06:48.404 ]
[ 00:07:07.278 --> 00:07:08.336 ]
[ 00:07:09.121 --> 00:07:09.991 ]
[ 00:07:10.571 --> 00:07:10.930 ]
[ 00:07:19.462 --> 00:07:19.889 ]
[ 00:07:25.230 --> 00:07:25.503 ]

[ 00:00:01.834 -->  00:00:02.022]
[ 00:00:10.571 -->  00:00:11.186]
[ 00:00:20.315 -->  00:00:21.117]
[ 00:00:26.749 -->  00:00:27.909]
[ 00:00:31.100 -->  00:00:31.970]
[ 00:00:35.554 -->  00:00:35.895]
[ 00:00:38.131 -->  00:00:38.421]
[ 00:00:39.257 -->  00:00:40.349]
[ 00:00:40.486 -->  00:00:45.691]
[ 00:01:04.769 -->  00:01:04.991]
[ 00:01:35.008 -->  00:01:35.401]
[ 00:01:36.424 -->  00:01:37.295]
[ 00:01:42.244 -->  00:01:43.831]
[ 00:01:45.025 -->  00:01:46.595]
[ 00:01:51.561 -->  00:01:52.534]
[ 00:02:00.383 -->  00:02:00.981]
[ 00:02:11.783 -->  00:02:12.346]
[ 00:02:16.868 -->  00:02:18.199]
[ 00:02:20.179 -->  00:02:20.708]
[ 00:02:28.011 -->  00:02:29.052]
[ 00:02:35.298 -->  00:02:36.612]
[ 00:03:10.315 -->  00:03:14.257]
[ 00:03:26.271 -->  00:03:26.561]
[ 00:03:27.500 -->  00:03:28.284]
[ 00:03:39.138 -->  00:03:39.360]
[ 00:03:41.220 -->  00:03:44.445]
[ 00:03:54.274 -->  00:03:54.650]
[ 00:03:57.687 -->  00:03:59.291]
[ 00:04:02.482 -->  00:04:03.626]
[ 00:04:06.783 -->  00:04:08.063]
[ 00:04:15.588 -->  00:04:18.233]
[ 00:04:18.933 -->  00:04:19.820]
[ 00:04:27.073 -->  00:04:30.674]
[ 00:04:39.581 -->  00:04:40.759]
[ 00:05:02.397 -->  00:05:02.755]
[ 00:05:04.018 -->  00:05:04.377]
[ 00:05:10.947 -->  00:05:12.602]
[ 00:05:13.046 -->  00:05:16.322]
[ 00:05:17.568 -->  00:05:18.831]
[ 00:05:20.639 -->  00:05:21.834]
[ 00:05:30.571 -->  00:05:31.151]
[ 00:05:33.114 -->  00:05:33.284]
[ 00:05:36.356 -->  00:05:39.854]
[ 00:05:40.571 -->  00:05:41.339]
[ 00:05:42.124 -->  00:05:43.779]
[ 00:05:44.837 -->  00:05:45.332]
[ 00:05:47.056 -->  00:05:47.926]
[ 00:05:50.281 -->  00:05:51.117]
[ 00:05:52.994 -->  00:05:54.854]
[ 00:06:19.155 -->  00:06:19.462]
[ 00:06:25.469 -->  00:06:26.237]
[ 00:06:27.807 -->  00:06:28.677]
[ 00:06:30.861 -->  00:06:31.493]
[ 00:06:32.175 -->  00:06:32.414]
[ 00:06:38.575 -->  00:06:39.257]
[ 00:06:43.216 -->  00:06:45.110]
[ 00:06:47.295 -->  00:06:48.404]
[ 00:07:07.278 -->  00:07:08.336]
[ 00:07:09.121 -->  00:07:09.991]
[ 00:07:10.571 -->  00:07:10.930]
[ 00:07:19.462 -->  00:07:19.889]
[ 00:07:25.230 -->  00:07:25.503]

Issues

For running on GPU, there is 10 milliseconds 'sleep' before every inference. If not, the final result will be inaccurate. Note, this does not apply to running on CPU. Take a wave file with 7+ minutes duration, will increase total delay: 28 * 10 = 280 milliseconds.

Performance

The code is not fully optimized and some memory leaks there( see comment in code ). There are many STL container copy, one way to avoid this simply use pure pointer, for example for audio data and inference result.

Verification

Since whole project is to translate pyannote-audio speaker diarization pipleline from python to C++, strategy I adopted here is write input/output of each small function in python to txt file, and do same for C++, then load txt file into python to compare and check difference. Target is to make each input and output is same. For this purpose, script/verifyEveryStepResult.py is created.

$> python verifyEveryStepResult.py

Above command is to compare txt files generated /tmp. and command below is to delete all the txt files.

$> python verifyEveryStepResult.py clean

entn-at / pyannote-audio_overlapped-speech-detection_cpp