Streamline preprocessing pipeline
daemon opened this issue · comments
Raphael Tang commented
Data preprocessing is currently split into multiple steps, i.e.,
- Download the datasets (where?).
- Run
run.preprocess_dataset
. - Write the corresponding
*.lab
files usingrun.export_mfa
. - Download Montreal Forced Aligner (MFA) and the corresponding CMU phonetic dictionary.
- Run MFA (
mfa_align
) over the speech corpus. - Convert the output TextGrids to our
jsonl
format (run.attach_mfa_alignment
).
We should make this process easier and document it somewhere.