Single- and Multi-Speaker Cloned Voice Detection: From Perceptual to Learned Features

This is the repository for the paper titled Single and Multi Speaker Cloned Voice Detection: From Perceptual to Learned Features submitted to the 2023 IEEE International Workshop on Information Forensics and Security (WIFS 2023).

The provided source code includes implementations of both the single-speaker and multi-speaker pipelines. However, please note that the dataset used in the experiments is not included in this repository. To replicate the experiments, you would need to create an analogous experimental dataset with cloned voices using different voice cloning architectures or providers.

The repository does provide code for data generation and adversarial laundering, specifically tailored for an example provider called ElevenLabs. You would need to generate features from the analogous dataset and save them to disk. Additionally, you will need to modify the relevant data handling code to ensure compatibility with your new dataset in order to run the pipeline successfully.

Please refer to the repository and the paper for more detailed instructions on how to use the code and conduct the experiments.

Folder Structure

The repository is structured as follows:

Folder	File	Description
Experiment Pipeline
`/src/`	`run_pipeline_ljspeech.py`	Runs the pipeline for single voice (LJSpeech) experiments
`/src/`	`run_pipeline_multivoice.py`	Runs the pipeline for multivoice experiments
`/src/packages/`	`ExperimentPipeline.py`	Class for running the experiment_pipeline and logging results
`/src/packages/`	`ModelManager.py`	Class for managing the final classification models
Feature Generation
`/src/packages/`	`AudioEmbeddingsManager.py`	Class for managing learned features generated using NVIDIA TitaNet
`/src/packages/`	`SmileFeatureManager.py`	Class for managing spectral features generated using openSMILE
`/src/packages/`	`SmileFeatureGenerator.py`	Class for generating spectral features and saving to disk for collections of audio files
`/src/packages/`	`SmileFeatureSelector.py`	Class for selecting spectral features using `sklearn.feature_selection`
`/src/packages/`	`CadenceModelManager.py`	Class for managing perceptual features generated using handcrafted technqiues
`/src/packages/`	`CadenceUtils.py`	Utility functions used by `CadenceModelManager` for generating features
`/src/packages/`	`BayesSearch.py`	A class that implements Bayesian Hyperparameter Optimization for perceptual model
`/src/packages/`	`SavedFeatureLoader.py`	Helper function for loading during experiments the generated features saved to disk
Data Loaders
`/src/packages/`	`LJDataLoader.py`	Class for loading and handling the LJSpeech data for experiments
`/src/packages/`	`TIMITDataLoader.py`	Class for loading and handling the TIMIT data for multi-voice experiments
Data Generation
`/src/packages/`	`BaseDeepFakeGenerator.py`	Base class used for processing data used for voice cloning
`/src/packages/`	`ElevenLabsDeepFakeGenerator.py`	Class used to generate deepfakes using the ElevenLabs API
`/src/packages/`	`AudioManager.py`	Class for resampling audio files and performing adversarial laundering
Misc
`.`	`README.md`	Provides an overview for the project
`.`	`conda_requirements.txt`	Dependencies for creating the `conda` environment
`.`	`pip_requirements.txt`	Dependencies installed with `pip`

Data

An overview of the real and synthetic datasets used in our single-speaker (top) and multi-speaker (bottom) evaluations. The 91,700 WaveFake samples correspond to 13,100 samples per each of seven different vocoder architectures, hence the larger number of clips and duration.

Single-speaker

Type	Name	Clips (#)	Duration (sec)
Real	LJSpeech	13,100	86,117
Synthetic	WaveFake	91,700	603,081
Synthetic	ElevenLabs	13,077	78,441
Synthetic	Uberduck	13,094	83,322

Multi-speaker

Type	Name	Clips (#)	Duration (sec)
Real	TIMIT	4,620	14,192
Synthetic	ElevenLabs	5,499	15,413

Publicly Available Data

The LJ Speech 1.1 Dataset -- Data
WaveFake: A Data Set to Facilitate Audio Deepfake Detection -- Paper, Data
TIMIT Acoustic-Phonetic Continuous Speech Corpus -- Data

Commercial Voice Cloning Tools

ElevenLabs (EL) -- https://beta.elevenlabs.io/
UberDuck (UD) -- https://app.uberduck.ai/

Results

Single-speaker

Accuracies for a personalized, single-speaker classification of unlaundered audio (top) and audio subject to adversarial laundering in the form of additive noise and transcoding (bottom). Dataset corresponds to ElevenLabs (EL), UberDuck (UD), and WaveFake (WF); Model corresponds to a linear (L) or non-linear (NL) classifier, and for a single-classifier (real v. synthetic) or multi-classifier (real vs. specific synthethis architecture); accuracy (%) is reported for synthetic audio, real audio, and (for the single-classifiers) equal error rate (EER) is also reported.

		Synthetic Accuracy (%)			Real Accuracy (%)			EER (%)
Dataset	Model	Learned	Spectral	Perceptual	Learned	Spectral	Perceptual	Learned	Spectral	Perceptual
Unlaundered
Binary
EL	single (L)	100.0	99.2	78.2	100.0	99.9	72.5	0.0	0.5	24.9
	single (NL)	100.0	99.9	82.2	100.0	100.0	80.4	0.0	0.1	18.6
UD	single (L)	99.8	98.9	51.9	99.9	98.9	54.0	0.1	1.1	47.2
	single (NL)	99.7	99.2	54.4	99.9	99.0	56.5	0.2	0.9	44.5
WF	single (L)	96.5	78.4	57.8	97.1	82.3	45.6	3.3	19.7	48.5
	single (NL)	94.5	87.6	50.3	96.7	90.2	52.7	4.4	11.2	48.6
EL+UD	single (L)	99.7	94.8	63.4	99.9	97.1	60.3	0.2	4.2	37.9
	single (NL)	99.7	99.2	57.3	99.9	99.6	69.0	0.2	0.8	37.6
EL+UD+WF	single (L)	93.2	79.7	58.4	98.7	93.0	57.6	3.6	15.9	42.1
	single (NL)	91.2	90.6	53.1	99.0	94.1	64.7	4.1	7.9	41.6
Multiclass
EL+UD	multi (L)	99.9	96.6	61.0	100.0	94.6	35.7	-	-	-
	multi (NL)	99.7	98.3	65.6	100.0	97.2	43.2	-	-	-
EL+UD+WF	multi (L)	98.8	80.2	45.1	97.3	64.3	22.9	-	-	-
	multi (NL)	98.1	94.2	48.6	96.3	84.4	27.6	-	-	-
Laundered
Binary
EL	single (L)	95.5	94.3	61.1	94.5	92.6	65.2	4.9	6.7	36.6
	single (NL)	96.0	96.2	70.4	95.4	95.6	69.6	4.1	4.1	30.1
UD	single (L)	95.4	81.1	61.4	91.8	84.3	44.7	6.3	17.3	46.7
	single (NL)	95.4	86.8	52.9	93.3	86.1	55.9	5.5	13.6	45.6
WF	single (L)	87.6	60.7	59.6	85.0	70.4	42.5	13.9	34.4	49.4
	single (NL)	83.6	77.1	51.4	85.6	76.7	53.9	15.3	23.1	47.3
EL+UD	single (L)	95.2	79.1	54.0	91.7	78.4	59.8	6.2	21.3	43.1
	single (NL)	94.8	86.1	55.2	93.3	90.0	62.4	6.0	12.0	41.4
EL+UD+WF	single (L)	83.7	70.9	50.6	88.6	72.9	59.7	13.2	28.2	44.8
	single (NL)	83.4	79.2	53.0	90.7	85.1	60.7	12.5	17.9	43.6
Multiclass
EL+UD	multi (L)	94.2	85.6	50.9	91.0	77.1	29.1	-	-	-
	multi (NL)	94.5	91.7	53.2	90.3	82.9	41.3	-	-	-
EL+UD+WF	multi (L)	89.8	65.4	35.3	83.1	44.3	26.2	-	-	-
	multi (NL)	88.8	78.8	39.8	82.1	63.0	28.6	-	-	-

Multi-speaker

Accuracies for a non-personalized, multi-speaker classification of unlaundered audio. Dataset corresponds to ElevenLabs (EL); Model corresponds to a linear (L) or non-linear (NL) classifier, and for a single-classifier (real v. synthetic) or multi-classifier (real vs. specific synthethis architecture); accuracy (%) is reported for synthetic audio, real audio, and (for the single-classifiers) equal error rate (EER) is also reported.

		Synthetic Accuracy (%)			Real Accuracy (%)			EER (%)
Dataset	Model	Learned	Spectral	Perceptual	Learned	Spectral	Perceptual	Learned	Spectral	Perceptual
EL	single (L)	100.0	94.2	83.8	99.9	98.3	86.9	0.0	3.0	1.3
	single (NL)	92.3	96.3	82.2	100.0	99.7	87.7	0.1	1.6	1.4

Research Group

Sarah Barrington¹ -- sbarrington@berkeley.edu
Romit Barua¹ -- romit_barua@berkeley.edu
Gautham Koorma¹ -- gautham.koorma@berkeley.edu
Hany Farid^1,2 -- hfarid@berkeley.edu

School of Information¹ and Electrical Engineering and Computer Sciences^1,2 at the University of California, Berkeley

This work was partially funded by a grant from the UC Berkeley Center For Long-Term Cybersecurity (CLTC), an award for open-source innovation from the Digital Public Goods Alliance and United Nations Development Program, and an unrestricted gift from Meta.

Citation

Please cite the following paper if you use this code:

@misc{barrington2023single,
      title={Single and Multi-Speaker Cloned Voice Detection: From Perceptual to Learned Features}, 
      author={Sarah Barrington and Romit Barua and Gautham Koorma and Hany Farid},
      year={2023},
      eprint={2307.07683},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

ChalieChang1028 / ClonedVoiceDetection

Single- and Multi-Speaker Cloned Voice Detection: From Perceptual to Learned Features

Folder Structure

Data

Single-speaker

Multi-speaker

Publicly Available Data

Commercial Voice Cloning Tools

Results

Single-speaker

Multi-speaker

Research Group

Citation

About

Languages