WenwanChen / SocialAmbiance-speaker-count

Pytorch based concurrent speaker count from continuous recordings

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SocialAmbiance-speaker-count

A Pytorch based concurrent speaker countalgorithm

Introduction: AmbianceCount is an automatic and objective method that extract social ambiance from unconstrained audio recordings by estimating the number of concurrent speakers. AmbianceCount consists of a supervised deep neural network (DNN) embedding extractor to differentiate speech mixtures, and a scoring system for estimation and improving generalization. The performance of AmbianceCount is compared with baseline and evaluated on several synthesized datasets. Lastly, I utilize AmbianceCount to evaluate data from a sociability pilot, with audio data from depression and psychosis patients as well as age-matched healthy controls. Our analysis shows that extracted social ambiance patterns are significantly different across three groups. Besides, it is observe that captured social ambiance patterns are associated with psychometric and personality scores, which is consistent with clinical diagnosis.

How to use AmbianceCount

Step 1: Remove silent segments using pre-trained Voice Activity Detection: https://github.com/ina-foss/inaSpeechSegmenter Suppose that audio recordings of participant_id is stored in /dataroot/Baylor/id, the first step is to remove all the silent segments.
| runSegment.sh
|---- 2h.py // just to prevent stackoverflow
|---- inaSpeechSegmenter toolkit

Step 2: Prepare 5s segments to feed into neural network
| preparation.sh
|---- 5s.py

Step 3: Kaldi-format Feature extaction
Acoustic features are extracted using Kaldi toolkit, and deep embedding features are extracted using trained neural networks
| toKaldiFormat.sh
|---- model.py // models are defined based on https://github.com/Snowdar/asv-subtools/blob/master/pytorch/model/resnet-xvector.py

Step 4: Back-end scoring and domain adaptation
Extracted embeddings are scored and adapted to generalize the algorithm using Kaldi
| backend.sh
|---- score.py

image

Step 5: Statistical analysis results: \

  1. Social Ambiance Measure (SAM) Differences Between Groups
    Figure 2 illustrates that social ambiance patterns extracted from participants with depressive or psychotic disorders were significantly different from healthy controls.

Figure2

  1. Relationship Between Social Ambiance Measure (SAM) and Self-Reported Measures
    Social ambiance patterns, while linked to some personality traits for healthy controls, were found associated with psychometric scores for participants with depressive or psychotic disorders.

table

For more details, please refer to our paper:

Chen W, Sabharwal A, Taylor E, Patel AB and Moukaddam N (2021) Privacy-Preserving Social Ambiance Measure From Free-Living Speech Associates With Chronic Depressive and Psychotic Disorders. Front. Psychiatry 12:670020. doi: 10.3389/fpsyt.2021.670020.

How to prepare training data

Overlapped speech creation:

We utilize LibriSpeech corpus to create overlapped speech. By combining three LibriSpeech subsets, clean-360, clean-100 and other-500, I get 960 hours of speech in utterances from 2338 speakers (1228 female speakers and 1210 male speakers), sampled at 16 kHz. Utterances were segmented when the silence intervals were longer than 0.3 seconds or coincided with sentence breaks in the reference text.

image

For each speaker, a 15-30 min recording is generated by concatenating LibriSpeech utterances of that speaker. Next, an audio segment is randomly selected from each recording, and segments are randomly adjusted in volume and speed to simulate how people speak in real-world scenarios.

image

Finally, adjusted segments from K speakers are trimmed to T seconds and overlapped with each other to generate a speech mixture, labelled with speaker count of K.

Scenarios creation:

To cover different acoustic scenarios, I add three categories of sound effects: background noises, foreground noises and reverberation using Kaldi toolkit. image

For more details about model training, please refer to my thesis:

AmbianceCount: An Objective Social Ambiance Measure from Unconstrained Day-long Audio Recordings.

About

Pytorch based concurrent speaker count from continuous recordings


Languages

Language:Shell 68.7%Language:Python 31.3%