.scp file for train/test set

Question

.scp file for train/test set

MittalShruti opened this issue 4 years ago · comments

Hi, I am new to kaldi.

I installed kaldi and tried yesno recipe. There I figured that to generate .scp files, you need to run run.sh, which thereby calls local/create_wav_scp.pl to generate the .scp file.
Also there's something called /kaldi/src/featbin/copy-feats which can also be used (haven't tried it yet).

Is there any other way that one can generate .scp files without installing Kaldi? Because I don't use kaldi for ASR.

Shruti Mittal commented 4 years ago

Thanks!

Mirco Ravanelli · Answer 1 · Sun Feb 16 2020 23:56:56 GMT+0800 (China Standard Time)

Hi, is this message for PASE+ or for the pytorch-kaldi project? Mirco

…

On Sun, 16 Feb 2020 at 10:33, Shruti Mittal ***@***.***> wrote: Hi, I am new to kaldi. I installed kaldi and tried yesno recipe. There I figured that to generate .scp files, you need to run run.sh, which thereby calls local/ create_wav_scp.pl to generate the .scp file. Also there's something called /kaldi/src/featbin/copy-feats which can also be used (haven't tried it yet). Is there any other way that one can generate .scp files without installing Kaldi? Because I don't use kaldi for ASR. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#85?email_source=notifications&email_token=AEA2ZVXTDPWK25AAOK6BG7LRDFMEXA5CNFSM4KWEHCYKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IN3J22A>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEA2ZVT73C4WMJZRBNSE75LRDFMEXANCNFSM4KWEHCYA> .

Shruti Mittal · Answer 2 · Mon Feb 17 2020 00:05:21 GMT+0800 (China Standard Time)

Hi, this is for PASE+. I need the .scp file containing the .wav filename in each line. So, after some google search, I figured that kaldi does it on running run.sh

Here, I am asking if I can generate the .scp file without using kaldi.

Pawel Swietojanski · Answer 3 · Mon Feb 17 2020 17:57:31 GMT+0800 (China Standard Time)

Hi Shruti,

Is this related to your previous issue on how to keep training PASE+ on additional data? In that case, an scp here is only a list of files (paths) to wavs you want to use to train/refine your model. There is not any utterance ids here (as it is the case with Kaldi scps).

For further training in self-supervised way on your data, you would need to generate data config and related stats file with normalisation stats and then keep training as in standard PASE+ case (look more for readme).

Though we only released encoder so far, ideally in keep-training case you would like also the weights of workers. I am sure, we can share this too, right @mravanelli ?

If you want to refine weights in a supervised way on your data, the encoder weights we shared to date are sufficient, just load them into the compute graph in torch, add your bits on top (follow up layers if any, your target objective) and keep training (i.e. a bit like a TIMIT recipe in ASR/ folder).

Hope this helps.

Shruti Mittal · Answer 4 · Mon Feb 17 2020 18:37:48 GMT+0800 (China Standard Time)

Thanks @pswietojanski for replying. Yes, both the issues are related

In that case, an scp here is only a list of files (paths) to wavs you want to use to train/refine your model. There is not any utterance ids here (as it is the case with Kaldi scps).

Right, I noticed that the .scp here is just the path to the .wav files. And Kaldi gets you an utterance_id wav_path file. So, how should I generate this .scp file? Is it as simple as changing the extension of file wav.txt (with wav file path, one in each line) to wav.scp? OR do I need to change the kaldi scripts to store only the wav_path?

For further training in self-supervised way on your data, you would need to generate data config and related stats file with normalisation stats and then keep training as in standard PASE+ case (look more for readme). Though we only released encoder so far, ideally in keep-training case you would like also the weights of workers

Yes, I would need the weights of workers. Or I don't fine tune the given encoder weights (the self supervised way), and train a PASE+ model from scratch on my data instead. Post that, I ll store the encoder+workers weights and use them later, when I get more data and would want to fine tune the PASE+.

If you want to refine weights in a supervised way on your data, the encoder weights we shared to date are sufficient, just load them into the compute graph in torch, add your bits on top (follow up layers if any, your target objective) and keep training (i.e. a bit like a TIMIT recipe in ASR/ folder).

The supervised training way is clear. I ll add a decoder/classifier on top.

Pawel Swietojanski · Answer 5 · Mon Feb 17 2020 19:28:55 GMT+0800 (China Standard Time)

So, how should I generate this .scp file? Is it as simple as changing the extension of file wav.txt (with wav file path, one in each line) to wav.scp? OR do I need to change the kaldi scripts to store only the wav_path?

It is even simpler, extension does not matter. You simply make a list and provide the file to the follow up scripts. Note, the whole procedure has several more steps, as you can observe from main README instructions. I.e. in order PASE training to work well, you need to pre-process your data-set a bit to fit the assumptions (remove silence, segment, make actual data config json). These scripts (for librispeech and some other datasets) are included in the repo.

Shruti Mittal · Answer 6 · Mon Feb 17 2020 19:38:43 GMT+0800 (China Standard Time)

Oh wow! This command had me believe that I can only pass .scp extension.

python unsupervised_data_cfg_librispeech.py --data_root data/LibriSpeech/wavs \
	--train_scp data/LibriSpeech/libri_tr.scp --test_scp data/LibriSpeech/libri_te.scp \
	--libri_dict data/LibriSpeech/libri_dict.npy --cfg_file data/librispeech_data.cfg

So, I ll make a xyz.txt file containing the list of wav files I am using.

I ll look at the other scripts to pre-process the data and generate data config json.

Thanks for the help :)

Mirco Ravanelli · Answer 7 · Tue Feb 18 2020 02:52:46 GMT+0800 (China Standard Time)

Hi, if needed you can find the weights of the workers here: https://drive.google.com/open?id=1jQFFcdtVs9Rm__3jEA1ubi2SOMSM8131 Best, Mirco

…

On Mon, 17 Feb 2020 at 07:17, Shruti Mittal ***@***.***> wrote: Closed #85 <#85>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#85?email_source=notifications&email_token=AEA2ZVUMPCY6IGDWAIT7ORTRDJ535A5CNFSM4KWEHCYKYY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOWVUBD7Q#event-3043496446>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEA2ZVQ5KVPYDMHVY6LEHR3RDJ535ANCNFSM4KWEHCYA> .