santi-pdp / pase

Problem Agnostic Speech Encoder

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

.scp file for train/test set

MittalShruti opened this issue · comments

Hi, I am new to kaldi.

I installed kaldi and tried yesno recipe. There I figured that to generate .scp files, you need to run run.sh, which thereby calls local/create_wav_scp.pl to generate the .scp file.
Also there's something called /kaldi/src/featbin/copy-feats which can also be used (haven't tried it yet).

Is there any other way that one can generate .scp files without installing Kaldi? Because I don't use kaldi for ASR.

Hi, this is for PASE+. I need the .scp file containing the .wav filename in each line. So, after some google search, I figured that kaldi does it on running run.sh

Here, I am asking if I can generate the .scp file without using kaldi.

Hi Shruti,

Is this related to your previous issue on how to keep training PASE+ on additional data? In that case, an scp here is only a list of files (paths) to wavs you want to use to train/refine your model. There is not any utterance ids here (as it is the case with Kaldi scps).

For further training in self-supervised way on your data, you would need to generate data config and related stats file with normalisation stats and then keep training as in standard PASE+ case (look more for readme).

Though we only released encoder so far, ideally in keep-training case you would like also the weights of workers. I am sure, we can share this too, right @mravanelli ?

If you want to refine weights in a supervised way on your data, the encoder weights we shared to date are sufficient, just load them into the compute graph in torch, add your bits on top (follow up layers if any, your target objective) and keep training (i.e. a bit like a TIMIT recipe in ASR/ folder).

Hope this helps.

Thanks @pswietojanski for replying. Yes, both the issues are related

In that case, an scp here is only a list of files (paths) to wavs you want to use to train/refine your model. There is not any utterance ids here (as it is the case with Kaldi scps).

Right, I noticed that the .scp here is just the path to the .wav files. And Kaldi gets you an utterance_id wav_path file. So, how should I generate this .scp file? Is it as simple as changing the extension of file wav.txt (with wav file path, one in each line) to wav.scp? OR do I need to change the kaldi scripts to store only the wav_path?

For further training in self-supervised way on your data, you would need to generate data config and related stats file with normalisation stats and then keep training as in standard PASE+ case (look more for readme). Though we only released encoder so far, ideally in keep-training case you would like also the weights of workers

Yes, I would need the weights of workers. Or I don't fine tune the given encoder weights (the self supervised way), and train a PASE+ model from scratch on my data instead. Post that, I ll store the encoder+workers weights and use them later, when I get more data and would want to fine tune the PASE+.

If you want to refine weights in a supervised way on your data, the encoder weights we shared to date are sufficient, just load them into the compute graph in torch, add your bits on top (follow up layers if any, your target objective) and keep training (i.e. a bit like a TIMIT recipe in ASR/ folder).

The supervised training way is clear. I ll add a decoder/classifier on top.

So, how should I generate this .scp file? Is it as simple as changing the extension of file wav.txt (with wav file path, one in each line) to wav.scp? OR do I need to change the kaldi scripts to store only the wav_path?

It is even simpler, extension does not matter. You simply make a list and provide the file to the follow up scripts. Note, the whole procedure has several more steps, as you can observe from main README instructions. I.e. in order PASE training to work well, you need to pre-process your data-set a bit to fit the assumptions (remove silence, segment, make actual data config json). These scripts (for librispeech and some other datasets) are included in the repo.

Oh wow! This command had me believe that I can only pass .scp extension.

python unsupervised_data_cfg_librispeech.py --data_root data/LibriSpeech/wavs \
	--train_scp data/LibriSpeech/libri_tr.scp --test_scp data/LibriSpeech/libri_te.scp \
	--libri_dict data/LibriSpeech/libri_dict.npy --cfg_file data/librispeech_data.cfg

So, I ll make a xyz.txt file containing the list of wav files I am using.

I ll look at the other scripts to pre-process the data and generate data config json.

Thanks for the help :)

Thanks!