self-supervised training from scratch

Question

self-supervised training from scratch

MittalShruti opened this issue 4 years ago · comments

Hi,

Training the self supervised model from scratch takes a lot of time on 1 GPU machine. For the data that I have, it takes ~8hrs to train for 1 epoch.

Apart from increasing the GPU count, do we have any other method to speed up the training?

Pawel Swietojanski · Answer 1 · Thu Mar 05 2020 18:15:56 GMT+0800 (China Standard Time)

How much data do you train on? In our case, training for an epoch on 50 hours variant of librispeech took around 30 minutes (or perhaps under this) on a single GTX2080 GPU. Make sure, you set --num-workers to smth like 16 or so, as there are much processing done on-the-fly in data loaders.

Shruti Mittal · Answer 2 · Thu Mar 05 2020 19:53:51 GMT+0800 (China Standard Time)

ok, my data is only ~10hrs, I am using google colab GPU right now, which is Tesla P100-PCIE-16GB. I tried with 16 workers. But time is almost the same. Earlier I was using num_workers=4

Epoch 0/1: 3% 192/6264 [16:37<5:28:56, 3.25s/it]

Code

!python -u  train.py --batch_size 16 --epoch 1 --save_path /content/pase+_ckpt \
	       --num_workers 16 --warmup 10000000 --net_cfg cfg/workers/workers+.cfg \
	       --fe_cfg cfg/frontend/PASE+.cfg --data_cfg /content/call_data.cfg \
	       --min_lr 0.0005 --fe_lr 0.001 --data_root /content/clips2_seg/ \
	       --dtrans_cfg cfg/distortions/pase+.cfg \
	       --stats /content/call_stats_pase+.pkl \
	       --chunk_size 32000 \
	       --tensorboard False \
	       --backprop_mode base\
	       --random_scale True\
	       --lr_mode poly

Pawel Swietojanski · Answer 3 · Thu Mar 05 2020 20:01:27 GMT+0800 (China Standard Time)

Not sure how Google's Colab assigns resources, but --num-workers 16 only makes sense if you have access to that many CPU cores (on top of a GPU). In that case data pre-processing is done in parallel num_worker processes while GPU does actual training. Check if there is a way to ask for more CPU cores for your session, ideally num_workers + 1. If that's not the case, then setting this to high number and running on a single core is likely to make things slower than faster.

Shruti Mittal · Answer 4 · Thu Mar 05 2020 20:45:59 GMT+0800 (China Standard Time)

couldn't find anything to upgrade CPU in google colab (for my country - non US).
I ll set up a vCPU=16 (or 32) machine on GCP to train then. And with a Tesla P100

Also, one more thing - if vCPU=32, can I use num_workers=32? Or should it be 16?

Pawel Swietojanski · Answer 5 · Thu Mar 05 2020 20:52:01 GMT+0800 (China Standard Time)

If it speeds things up, then sure (for our setup 16 was about OK). See what seem to be the best setting in your case (this is an overall balancing game between how quickly GPU can consume the data vs. how quickly data loader can feed it into the GPU, which in turn depends on disk io speeds, etc.).

Shruti Mittal · Answer 6 · Fri Mar 06 2020 23:11:29 GMT+0800 (China Standard Time)

Hey, I tried on P100, num_workers=16 (/8). It still takes pretty long to train 1 epoch ~4hrs. Tried playing with batch size as well. My GPU usage is intermittent (obviously because data pre-processing is on the fly) but touches 100% atleast once every iteration.

What else can be done? Can we pre-process the data beforehand?

Also, I am doubtful about the first 2 lines in the distortion cfg. First line = path to the folder containing the segmented audio files, 2nd line = file containing the path to the test files?

[EDIT] my training data is 55hrs long. Not 10hrs

Pawel Swietojanski · Answer 7 · Sat Mar 07 2020 07:58:48 GMT+0800 (China Standard Time)

Well, it's clear there is a large bottleneck somewhere. It's most likely IO related due to slow disk access (i.e. reading waves, rather than augmenting them later). Where do you keep your actual dataset, on gdrive? This is likely to be very slow to read.

On PASE side, the code has option to cache the waveforms (or even whole transforms) into memory and you should check it out (if you have enough RAM on virtual node to store whole dataset, which I think for 10 hours should not be an issue). Note, this caching works on-the-fly, so first epoch would be still slow as the cache is being build. Due to how the on-the-fly augmentation works, I suggest to cache wave-forms only (not the whole transforms).

Otherwise, you need to check what are the other options to link faster storage in Colab directly, i.e. is it possible to copy files to Colab, rather than link them from GDrive? Can you copy them over from Gdrive to local node where the code actually runs prior to the actual training? (if that's the case data is accessed from GDrive). I do not have enough hands on expertise with Colab to advise any particular solution, but I am sure people experienced similar issues on their side, and things are widely discussed online.

Shruti Mittal · Answer 8 · Sat Mar 07 2020 09:37:25 GMT+0800 (China Standard Time)

Ok, I ll try the caching.

Bdw I am not using GDrive or Colab. Everything is on Google Compute - data, model, training.

Shruti Mittal · Answer 9 · Wed Mar 11 2020 17:02:12 GMT+0800 (China Standard Time)

Hi @pswietojanski

I am not able to get the caching working. I tried to set --cache_on_load and --trans_cache=True but it throws error. What should be passed to the --trans_cache argument? --cache_on_load is not being used inside the code, so I don't know how it works.

On GCP, I tried using local-ssd as well, so that the disk is attached to the instance; but I am getting same speed as before.
Would you have more ideas to solve this problem? At current training speed, it ll take atleast 40 days to train the PASE+ model. And I dont have the resources to do it.

Pawel Swietojanski · Answer 10 · Wed Mar 11 2020 20:01:10 GMT+0800 (China Standard Time)

Thanks for reporting back on this. Do you have any way to tell the stats on how the machine is being used during training session? Ideally something along screen shot from linux top tool (or iotop).

Shruti Mittal · Answer 11 · Fri Mar 13 2020 16:59:02 GMT+0800 (China Standard Time)

Hey, sorry was travelling last 2 days.

Attaching cprofile on google colab. This is after setting cache_on_load = True and preload_wav=True in dataset.py file; num_workers=1

l ll post the iotop results once I run the code on GCP.

Shruti Mittal · Answer 12 · Fri Mar 13 2020 19:16:26 GMT+0800 (China Standard Time)

Hi @pswietojanski

Cpu usage looks ~100%. Any comments here?

this is htop output on GCP - for 1st epoch, using num_workers=16; P100; cache_on_load=True; preload_wav=True

this is iotop output on GCP

Pawel Swietojanski · Answer 13 · Sun Mar 15 2020 18:52:59 GMT+0800 (China Standard Time)

Thanks. So one more thing you want to try is to limit each data loading thread to one CPU core. (at math algebra level) Now it looks like each thread is trying to max out the whole machine, which is likely to lead to some race conditions thus limit the performance.

If you are using a default pytorch installation, it means it links to intel's mkl. To force one thread to not use more than 1 core, type smth like this in top level run.sh script: export OMP_NUM_THREADS=1

Look at load average stat in htop screenshot -> it stays at 65, while given your machine is 16 core should be around 16. (load is not necessarily linked to cpu usage, it could be io too). But its worth to fix all potential issues.

Shruti Mittal · Answer 14 · Mon Mar 16 2020 23:42:55 GMT+0800 (China Standard Time)

I did os.environ["OMP_NUM_THREADS"] = "1", num_workers = 4
doesn't reduce the time much (20-30min max)

Shruti Mittal · Answer 15 · Tue Mar 17 2020 02:55:19 GMT+0800 (China Standard Time)

Hey i am getting better speed now, was using lower no. of CPU cores and K80 machine earlier.
with num_workers = 8 and V100 time to train 1 epoch is ~2hrs
i ll set up a 16core machine and see if the time to train is halved

Thanks for the suggestion! I know i am stretching this now, but any more suggestions? :)

Shruti Mittal · Answer 16 · Wed Mar 18 2020 09:12:03 GMT+0800 (China Standard Time)

with num_workers = 16 P100 setting os.environ["OMP_NUM_THREADS"] = "1" the epoch trains in ~90mins preload_wav = True cache_on_load = True
however preloading is not improving speed for epoch 2 onwards [something i can do here?]

htop output

any other suggestion to improve the speed?

Pawel Swietojanski · Answer 17 · Wed Mar 18 2020 09:51:58 GMT+0800 (China Standard Time)

Looks like the overall system is much better balanced now (no race conditions, well loaded cores). How much data you pretrain on in this setup, 50 horus? You can see nvidia stats on how busy the GPU is -> try to max out the GPU memory by increasing batch size to smth like 32 instead of 16 (or whatever is fits the memory). Going from 8 to 16 workers decreased training time by 30 minutes (thats a lot), which means data still cannot be fed fast enough into the training module -> you can further increase num_workers to see where it stops helping (and the corresponding num of cpu cores).

Shruti Mittal · Answer 18 · Wed Mar 18 2020 10:48:06 GMT+0800 (China Standard Time)

why is caching not increasing the speed? - setting preload_wav = True and cache_on_load=True in train.py. Could cpu to gpu data transfer time be a bottleneck?

Dung Nguyen · Answer 19 · Thu Sep 17 2020 00:29:47 GMT+0800 (China Standard Time)

HI @MittalShruti
Could you share idea on how to create your own /content/clips2_seg/?

Which files have you used to do that?

I would like to thank you a lot for your help and appreciate

ArtemisZGL · Answer 20 · Mon Sep 28 2020 15:51:21 GMT+0800 (China Standard Time)

@pswietojanski hello, could you explain the question about the overlap_list mentioned by @MittalShruti above? I'm confused about why this point to the test set too. Is that mean you use the test set as the overlap data?

also, I have another question, have you tried to use multi-gpu, because i found the code only use one gpu at most. And unlabeled data for self-supervised should be large.

And after reading through the comments above, it's still unclear how to use the cache. And I found the code is caching all the wav data to ram, why isn't the transform is the bottleneck ? have you do any profile ?

Dung Nguyen · Answer 21 · Mon Oct 19 2020 09:33:41 GMT+0800 (China Standard Time)

Hi @MittalShruti did you segment your data before runing pase+

Shruti Mittal · Answer 22 · Mon Oct 19 2020 09:42:35 GMT+0800 (China Standard Time)

Hey, sorry this was long back. Dont remember the details now. I pretty much followed the ReadMe and the training scripts to understand the data pre-processing pipeline.

Dung Nguyen · Answer 23 · Mon Oct 19 2020 13:00:21 GMT+0800 (China Standard Time)

Hi @MittalShruti I really appreciate your reply, could you share your code? it would be much useful

uuwz · Answer 24 · Thu Sep 07 2023 20:21:40 GMT+0800 (China Standard Time)

Hello! I have been replicating this experiment recently, but during the process of making the dataset config file, do I know where to obtain these files. (-- train_scp data/LibriSpeed/libri_tr.scp -- test_scp data/LibriSpeed/libri_te.scp\

--Libri_ Dict data/LibriSpeed/Libri_ Dict. npy). I look forward to your reply very much. Thank you.