Cached English Common Voice dataset size.
guynich opened this issue · comments
My disk HF cache has this folder 2.1T ./mozilla-foundation___common_voice_13_0
using command du -h ./mozilla-foundation___common_voice_13_0
. I assume this is the unprocessed Common Voice dataset and it is is 2.1TB.
I created a pseudo-labelled dataset of "mozilla-foundation/common_voice_13_0"
(e.g.: I assume it created the cached folder above) using pseudo-labelling script options *_config_name
as "en"
to create an English pseudo-labelled version of the dataset called common_voice_13_0_en_pseudo_labelled_large_v3_str
.
I'm running the distillation script from the training README Stage 3 here and it is mid-process of generating train/evaluation/test splits. The cached folder for my pseudo-labelled dataset has increased to 14TB and is growing. Luckily I have an instance with expandable storage.
Inspecting the cache folders du -h ./common_voice_13_0_en_pseudo_labelled_large_v3_str/
I see multiple default folders.
977G ./common_voice_13_0_en_pseudo_labelled_large_v3_str/default/0.0.0/86df6eb69614a3b8
81G ./common_voice_13_0_en_pseudo_labelled_large_v3_str/default/0.0.0/d77299dbcd226395
221G ./common_voice_13_0_en_pseudo_labelled_large_v3_str/default/0.0.0/ec2c020908a23a69
324G ./common_voice_13_0_en_pseudo_labelled_large_v3_str/default/0.0.0/3c4d7a51735ffa53
648G ./common_voice_13_0_en_pseudo_labelled_large_v3_str/default/0.0.0/6728d7a8c8821ed2
1.3T ./common_voice_13_0_en_pseudo_labelled_large_v3_str/default/0.0.0/d4d5052f1224937b
2.6T ./common_voice_13_0_en_pseudo_labelled_large_v3_str/default/0.0.0/67bfb1d58dc91573
5.1T ./common_voice_13_0_en_pseudo_labelled_large_v3_str/default/0.0.0/ee62aeed963be186
14T ./common_voice_13_0_en_pseudo_labelled_large_v3_str/default/0.0.0
14T ./common_voice_13_0_en_pseudo_labelled_large_v3_str/default
14T ./common_voice_13_0_en_pseudo_labelled_large_v3_str/
Question: is this increase in size expected with training preprocessing of the dataset?
My mistake here - closing. I had wrongly assumed the HF datasets cache is the location to write the pseudo-labelled dataset. Not so.
Re-running with a pseudo-labelled dataset of 43 GB the preprocessed cache folder is 718 GB, or 17x larger.