JTubeSpeech+: Corpus of Japanese speech collected from YouTube

This is a fork from https://github.com/sarulab-speech/jtubespeech

This repository provides 1) a list of YouTube videos with Japanese subtitles (JTubeSpeech), 2) scripts for making new lists of new languages, and 3) tiny lists for other languages.

Description

data/{lang}/{YYYYMM}.csv lists as follows. See step4 for download.

	videoid	auto	sub	channelid
0	0017RsBbUHk	True	True	UCTW2tw0Mhho72MojB1L48IQ
1	00PqfZgiboc	False	True	UCzoghTgl4dvIW9GZF6UC-BA
---	---	---	---	---

lang: Language ID (ja [Japanese], en [English], ...)
YYYYMM: Year and month when we collect data
videoid: YouTube video ID. Its YouTube page is https://www.youtube.com/watch?v={videoid}.
auto: The video has an automatic subtitle or not.
sub: The video has a manual (i.e., human-generated) subtitle or not.
channelid: YouTube Channel ID. Its YouTube page is https://www.youtube.com/channel/{channelid}.

Statistics

lang	filename (data/)	#videos-sub-true	#videos-auto-true
ja	ja/202103.csv	110,000 (10,000 hours)	4,960,000
en	en/202108_tiny.csv	74,227	65,570
zh	zh/202108_tiny.csv	63,126	23,387
th	th/202108_tiny.csv	40,886	26,907
ru	ru/202108_tiny.csv	39,890	46,061
hi	hi/202108_tiny.csv	34,034	31,439
ar	ar/202108_tiny.csv	31,993	42,649
de	de/202108_tiny.csv	30,727	66,954
tr	tr/202108_tiny.csv	27,317	68,079
el	el/202108_tiny.csv	25,947	26,735
fr	fr/202108_tiny.csv	25,371	70,466
ta	ta/202108_tiny.csv	21,860	26,120
da	da/202108_tiny.csv	18,779	62,094
id	id/202108_tiny.csv	18,086	72,760
bn	bn/202108_tiny.csv	16,315	57,112
fi	fi/202108_tiny.csv	15,561	50,626
my	my/202108_tiny.csv	14,729	95,755
hu	hu/202108_tiny.csv	13,154	49,237
te	te/202108_tiny.csv	11,929	24,444
pt	pt/202108_tiny.csv	11,692	48,974
az	az/202108_tiny.csv	11,188	52,025
ur	ur/202108_tiny.csv	10,917	26,503
is	is/202108_tiny.csv	10,632	38,268
fa	fa/202108_tiny.csv	10,482	24,102
ka	ka/202108_tiny.csv	10,395	23,914
uk	uk/202108_tiny.csv	9,103	36,392
ml	ml/202108_tiny.csv	9,080	42,359
ga	ga/202108_tiny.csv	9,058	51,411
be	be/202108_tiny.csv	7,622	37,739
ky	ky/202108_tiny.csv	7,241	42,027
kk	kk/202108_tiny.csv	6,917	26,163
tg	tg/202108_tiny.csv	5,491	40,244

Contributors

Shinnosuke Takamichi (The University of Tokyo, Japan) [main contributor]
Ludwig Kürzinger (Technical University of Munich, Germany)
Takaaki Saeki (The University of Tokyo, Japan)
Sayaka Shiota (Tokyo Metropolitan University, Japan)
Shinji Watanabe (Carnegie Mellon University, USA)

Docker

docker pull cadic/jtubespeechp
docker run --rm --name -v /FileStore:/Filestore -it jtubespeechp

Scripts for data collection

jtubespeechp is a package for data collection from YouTube. Since processes of the scripts are language independent, users can collect data of their favorite languages. youtube-dl and ffmpeg are required.

step1: making search words

The module jtubespeechp/search downloads the wikipedia dump file and finds words for searching videos. {lang} is the language code, e.g., ja (Japanese) and en (English).

$ python -m jtubespeechp.search {lang}

step2: obtaining video IDs

The module jtubespeechp/video_id obtains YouTube video IDs by searching by words. {filename_word_list} is a word list file made in step1. After this step, the process will take a long time. It is recommended to split the files (e.g., {filename_word_list}) and run them in parallel.

$ python -m jtubespeechp.video_id {lang} {filename_word_list}

step3: checking if subtitles are available

The module jtubespeechp/subtitles retrieves whether the video has subtitles or not. {filename_videoid_list} is a videoID list file made in step2. This process will make a CSV file.

$ python -m jtubespeechp.subtitles  {lang} {filename_videoid_list}

step4: downloading videos with manual subtitles

The module jtubespeechp/download downloads audio and manual subtitles. Note that, this process requires a very large amount of storage.{filename_subtitle_list} is a subtitle list file made in step3. The audio and subtitles will be saved in video/{lang}/wav16k and video/{lang}/txt, respectively.

$ python -m jtubespeechp.download  {lang} {filename_subtitle_list}

step5 (ASR): alignment and scoring

Subtitles are not always correctly aligned with the audio and in some cases, subtitles not fit to the audio. The script jtubespeechp/align aligns subtitles and audio with CTC segmentation using an ESPnet 2 ASR model:

$ python -m jtubespeechp.align {asr_train_config} {asr_model_file} {wavdir} {txtdir} {output_dir}

The result is written into a segments file segments.txt and a log file segments.log in the output directory. Using the segments file, bad utterances or audio files can be sorted-out:

min_confidence_score=-0.3
awk -v ms=${min_confidence_score} '{ if ($5 > ms) {print} }' ${output_dir}/segments.txt

JeanMaximilienCadic / jtubespeechplus