Any more information about the structure of the folder
samarth12 opened this issue · comments
I have been looking at some smaller datasets that are available for example public_youtube700_val
and public_lecture_1
. I would like to learn more about what the structure of the folder means.
For the public_youtube700_val
I see folders from 0-9
and a-f
with several subfolders. What do these numbers mean? Are all the audio files from each of these top-level root folders from a different youtube video or are they from the same video? Is there a way to determine this?
Hi!
This is what you are looking for
https://github.com/snakers4/open_stt#on-disk-db-methodology
Hi, @snakers4 thanks for getting back to me. But I do not understand what this script you pointed me to does? I am not sure how to run that code chunk either, what is wav
?
If I only want to use the transcripts, I guess my question is - in case I need to use the public_youtube700_val
how should I combine the contents of the folders and more importantly what can be considered as a single file/video content after combining the folders.
Can each subfolder in the root folder that I just downloaded for public_youtube700_val
be considered as an independent youtube content file? Till which level can I merge the single transcript .txt
files to consider the whole chunk as a single youtube file?
What this script you pointed me to does?
Calculates a path of a given file
I am not sure how to run that code chunk either, what is wav?
Wav is a file format
Hashes were calculated for wav files before conveting to opus
what can be considered as a single file/video content after combining the folders.
We opted not to provide such information.
be considered as an independent youtube content file
No