speech-datasets

Various speech datasets made available to the public.

Release Notes

`2023012`

rttms - added RTTM files to evaluate diarization (DER and possibly other metrics)

`202309`

longform-reconstitution: Added long-form data described in https://arxiv.org/abs/2309.15013

`202206`

earnings-21: Updated the some reference transcripts with some errors identified as part of our routine testing.
- Diff: +44 −45
- Modified 8 nlp files and 1 norm.json file
earnings-22: No changes since release of 202204

Datasets
How to Check Out Only a Single Dataset

Datasets

In each dataset, the most up-to-date version of the dataset will always be in the main branch. Any suggested improvements should be pull-requested off of the develop branch.

Dataset	Description
`earnings21`	This dataset contains 44 files totalling roughly 39 hours of earnings calls from the year 2020. This dataset provides the full audios, the transcripts, and accompanying metadata such as speaker labels, punctuation, and entity tags.
`earnings22`	This dataset contains 125 files totalling roughly 119 hours of English language earnings calls from global countries. This dataset provides the full audios, transcripts, and accompanying metadata such as ticker symbol, headquarters country, and our defined "Language Region".
`longform-reconstitution`	Long-form versions of the Gigaspeech, TED-LIUM, and VoxPopuli-en corpora. See https://arxiv.org/abs/2309.15013 for details

How to Check Out Only a Single Dataset

As we grow this repository, we expect that it will grow in size and will make pulling the whole repository difficult. In order to download only one dataset we recommend following this post on StackOverflow. We outline the process here for ease of use.

Ensure that you have >= git 2.30.0
Run the following commands such that ${DATASET_NAME} is set to the dataset directory you want to use.

git clone --depth 1  --filter=blob:none  --sparse https://github.com/revdotcom/speech-datasets.git
cd speech-datasets
git sparse-checkout init --cone
git sparse-checkout set ${DATASET_NAME}

e.g.

git clone --depth 1  --filter=blob:none  --sparse https://github.com/revdotcom/speech-datasets.git
cd speech-datasets
git sparse-checkout init --cone
git sparse-checkout set earnings21

These commands will only checkout the earnings21 dataset.

Github Large File Storage (LFS)

Due to the length of some media files in our corpora, we have added Github's LFS to allow us to upload large files for ease of download by users.

The impact is a few added steps to be able to access these files.

Affected Datasets

earnings22
longform-reconstitution

Steps to Download from LFS

The first step is to download and install Git LFS onto your machine. We recommend following Github's step-by-step instructions found here
Run the following commands from the main directory of speech-datasets:

cd ${DATASET_NAME}
git lfs pull