Streamline preprocessing pipeline

Question

daemon opened this issue 4 years ago · comments

Data preprocessing is currently split into multiple steps, i.e.,

Download the datasets (where?).
Run run.preprocess_dataset.
Write the corresponding *.lab files using run.export_mfa.
Download Montreal Forced Aligner (MFA) and the corresponding CMU phonetic dictionary.
Run MFA (mfa_align) over the speech corpus.
Convert the output TextGrids to our jsonl format (run.attach_mfa_alignment).

We should make this process easier and document it somewhere.