snakers4 / open_stt

Open STT

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Any more information about the structure of the folder

samarth12 opened this issue · comments

I have been looking at some smaller datasets that are available for example public_youtube700_val and public_lecture_1. I would like to learn more about what the structure of the folder means.

For the public_youtube700_val I see folders from 0-9 and a-f with several subfolders. What do these numbers mean? Are all the audio files from each of these top-level root folders from a different youtube video or are they from the same video? Is there a way to determine this?

Hi, @snakers4 thanks for getting back to me. But I do not understand what this script you pointed me to does? I am not sure how to run that code chunk either, what is wav?

If I only want to use the transcripts, I guess my question is - in case I need to use the public_youtube700_val how should I combine the contents of the folders and more importantly what can be considered as a single file/video content after combining the folders.

image

Can each subfolder in the root folder that I just downloaded for public_youtube700_val be considered as an independent youtube content file? Till which level can I merge the single transcript .txt files to consider the whole chunk as a single youtube file?

What this script you pointed me to does?

Calculates a path of a given file

I am not sure how to run that code chunk either, what is wav?

Wav is a file format
Hashes were calculated for wav files before conveting to opus

what can be considered as a single file/video content after combining the folders.

We opted not to provide such information.

be considered as an independent youtube content file

No