santi-pdp / pase

Problem Agnostic Speech Encoder

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

self-supervised training from scratch

MittalShruti opened this issue · comments

Hi,

Training the self supervised model from scratch takes a lot of time on 1 GPU machine. For the data that I have, it takes ~8hrs to train for 1 epoch.

Apart from increasing the GPU count, do we have any other method to speed up the training?

How much data do you train on? In our case, training for an epoch on 50 hours variant of librispeech took around 30 minutes (or perhaps under this) on a single GTX2080 GPU. Make sure, you set --num-workers to smth like 16 or so, as there are much processing done on-the-fly in data loaders.

ok, my data is only ~10hrs, I am using google colab GPU right now, which is Tesla P100-PCIE-16GB. I tried with 16 workers. But time is almost the same. Earlier I was using num_workers=4

Epoch 0/1: 3% 192/6264 [16:37<5:28:56, 3.25s/it]

Code

!python -u  train.py --batch_size 16 --epoch 1 --save_path /content/pase+_ckpt \
	       --num_workers 16 --warmup 10000000 --net_cfg cfg/workers/workers+.cfg \
	       --fe_cfg cfg/frontend/PASE+.cfg --data_cfg /content/call_data.cfg \
	       --min_lr 0.0005 --fe_lr 0.001 --data_root /content/clips2_seg/ \
	       --dtrans_cfg cfg/distortions/pase+.cfg \
	       --stats /content/call_stats_pase+.pkl \
	       --chunk_size 32000 \
	       --tensorboard False \
	       --backprop_mode base\
	       --random_scale True\
	       --lr_mode poly

Not sure how Google's Colab assigns resources, but --num-workers 16 only makes sense if you have access to that many CPU cores (on top of a GPU). In that case data pre-processing is done in parallel num_worker processes while GPU does actual training. Check if there is a way to ask for more CPU cores for your session, ideally num_workers + 1. If that's not the case, then setting this to high number and running on a single core is likely to make things slower than faster.

couldn't find anything to upgrade CPU in google colab (for my country - non US).
I ll set up a vCPU=16 (or 32) machine on GCP to train then. And with a Tesla P100

Also, one more thing - if vCPU=32, can I use num_workers=32? Or should it be 16?

If it speeds things up, then sure (for our setup 16 was about OK). See what seem to be the best setting in your case (this is an overall balancing game between how quickly GPU can consume the data vs. how quickly data loader can feed it into the GPU, which in turn depends on disk io speeds, etc.).

Hey, I tried on P100, num_workers=16 (/8). It still takes pretty long to train 1 epoch ~4hrs. Tried playing with batch size as well. My GPU usage is intermittent (obviously because data pre-processing is on the fly) but touches 100% atleast once every iteration.

What else can be done? Can we pre-process the data beforehand?

Also, I am doubtful about the first 2 lines in the distortion cfg. First line = path to the folder containing the segmented audio files, 2nd line = file containing the path to the test files?

[EDIT] my training data is 55hrs long. Not 10hrs

Well, it's clear there is a large bottleneck somewhere. It's most likely IO related due to slow disk access (i.e. reading waves, rather than augmenting them later). Where do you keep your actual dataset, on gdrive? This is likely to be very slow to read.

On PASE side, the code has option to cache the waveforms (or even whole transforms) into memory and you should check it out (if you have enough RAM on virtual node to store whole dataset, which I think for 10 hours should not be an issue). Note, this caching works on-the-fly, so first epoch would be still slow as the cache is being build. Due to how the on-the-fly augmentation works, I suggest to cache wave-forms only (not the whole transforms).

Otherwise, you need to check what are the other options to link faster storage in Colab directly, i.e. is it possible to copy files to Colab, rather than link them from GDrive? Can you copy them over from Gdrive to local node where the code actually runs prior to the actual training? (if that's the case data is accessed from GDrive). I do not have enough hands on expertise with Colab to advise any particular solution, but I am sure people experienced similar issues on their side, and things are widely discussed online.

Ok, I ll try the caching.

Bdw I am not using GDrive or Colab. Everything is on Google Compute - data, model, training.

Hi @pswietojanski

I am not able to get the caching working. I tried to set --cache_on_load and --trans_cache=True but it throws error. What should be passed to the --trans_cache argument? --cache_on_load is not being used inside the code, so I don't know how it works.

On GCP, I tried using local-ssd as well, so that the disk is attached to the instance; but I am getting same speed as before.
Would you have more ideas to solve this problem? At current training speed, it ll take atleast 40 days to train the PASE+ model. And I dont have the resources to do it.

Thanks for reporting back on this. Do you have any way to tell the stats on how the machine is being used during training session? Ideally something along screen shot from linux top tool (or iotop).

Hey, sorry was travelling last 2 days.

Attaching cprofile on google colab. This is after setting cache_on_load = True and preload_wav=True in dataset.py file; num_workers=1

Screenshot from 2020-03-13 14-26-46

l ll post the iotop results once I run the code on GCP.

Hi @pswietojanski

Cpu usage looks ~100%. Any comments here?

this is htop output on GCP - for 1st epoch, using num_workers=16; P100; cache_on_load=True; preload_wav=True

Screenshot from 2020-03-13 16-45-27

this is iotop output on GCP

Screenshot from 2020-03-13 22-19-23

Thanks. So one more thing you want to try is to limit each data loading thread to one CPU core. (at math algebra level) Now it looks like each thread is trying to max out the whole machine, which is likely to lead to some race conditions thus limit the performance.

If you are using a default pytorch installation, it means it links to intel's mkl. To force one thread to not use more than 1 core, type smth like this in top level run.sh script: export OMP_NUM_THREADS=1

Look at load average stat in htop screenshot -> it stays at 65, while given your machine is 16 core should be around 16. (load is not necessarily linked to cpu usage, it could be io too). But its worth to fix all potential issues.

I did os.environ["OMP_NUM_THREADS"] = "1", num_workers = 4
doesn't reduce the time much (20-30min max)

Hey i am getting better speed now, was using lower no. of CPU cores and K80 machine earlier.
with num_workers = 8 and V100 time to train 1 epoch is ~2hrs
i ll set up a 16core machine and see if the time to train is halved

Thanks for the suggestion! I know i am stretching this now, but any more suggestions? :)

with num_workers = 16 P100 setting os.environ["OMP_NUM_THREADS"] = "1" the epoch trains in ~90mins preload_wav = True cache_on_load = True
however preloading is not improving speed for epoch 2 onwards [something i can do here?]

htop output
Screenshot from 2020-03-18 06-38-42

any other suggestion to improve the speed?

Looks like the overall system is much better balanced now (no race conditions, well loaded cores). How much data you pretrain on in this setup, 50 horus? You can see nvidia stats on how busy the GPU is -> try to max out the GPU memory by increasing batch size to smth like 32 instead of 16 (or whatever is fits the memory). Going from 8 to 16 workers decreased training time by 30 minutes (thats a lot), which means data still cannot be fed fast enough into the training module -> you can further increase num_workers to see where it stops helping (and the corresponding num of cpu cores).

why is caching not increasing the speed? - setting preload_wav = True and cache_on_load=True in train.py. Could cpu to gpu data transfer time be a bottleneck?

HI @MittalShruti
Could you share idea on how to create your own /content/clips2_seg/?

Which files have you used to do that?

I would like to thank you a lot for your help and appreciate

@pswietojanski hello, could you explain the question about the overlap_list mentioned by @MittalShruti above? I'm confused about why this point to the test set too. Is that mean you use the test set as the overlap data?

also, I have another question, have you tried to use multi-gpu, because i found the code only use one gpu at most. And unlabeled data for self-supervised should be large.

And after reading through the comments above, it's still unclear how to use the cache. And I found the code is caching all the wav data to ram, why isn't the transform is the bottleneck ? have you do any profile ?

Hi @MittalShruti did you segment your data before runing pase+

Hey, sorry this was long back. Dont remember the details now. I pretty much followed the ReadMe and the training scripts to understand the data pre-processing pipeline.

Hi @MittalShruti I really appreciate your reply, could you share your code? it would be much useful

commented

Hello! I have been replicating this experiment recently, but during the process of making the dataset config file, do I know where to obtain these files. (-- train_scp data/LibriSpeed/libri_tr.scp -- test_scp data/LibriSpeed/libri_te.scp\

--Libri_ Dict data/LibriSpeed/Libri_ Dict. npy). I look forward to your reply very much. Thank you.