🚀 SuperVAD dataset

This repository contains the one million of 5 second segments of augmented voice and noise combinations with labels.

Dataset

Duration: 2M of files, 5 seconds length and 50k of test files.
Audio format: WAV files with 16kHz sampling rate and 16 bit depth
Signal-to-noise ratio (SNR): from 3 to 30db
Source voice speedup or slowdown: from 0.8 to 1.5
Synthetic and Real Room Impulse Response (RIR) reverberation
Encoding codecs are included in half of the samples: low/high quality mp3, G2.111

I am also publishing source files that are used for mixing, they are all wav files withg 16kHz sampling rate:

v2 - filtered some files with too loud background voices, removed some songs from the dataset that also had voice
v1 - initial release

Musan (CC BY 4.0) - Clean Voice and Noises
SLR26 (CC BY 4.0) - Synthetic RIR
SLR28 (Apache 2.0) - Real RIR
VOiCES (CC BY 4.0) - Clean Voice, Noises and RIR
DNS-4 (Public Domain/CC BY 4.0/Attr) - Clean Voice and Noises
Realistic urban sound mixture dataset (CC BY 4.0) - Noises
Common Voice 16.0 (Mozilla Public License 2.0) - Unused for now

Caution

Downloading and synthesizing the dataset requires about 8TB of disk space and several hours to download, unpack and synthesize.

To download source datasets, you can invoke download.sh script. For this script aria2 is required.

./download.sh

Script have very limited amount of dependencies that you probabbly already have installed.

pip install tqdm glob torch torchaudio soundfile

Before synthesizing the dataset, you need to prepare source datasets. To do so, you can invoke prepare.py script.

python3 prepare.py

To synthesize the dataset, you can invoke synthesize.py script.

python3 synthesize.py

To package the dataset you need tar and pigz to be installed.

./pack.sh

CC BY 4.0