Spikes in data loading time

Question

Spikes in data loading time

greeneggsandyaml opened this issue 2 years ago · comments

Hello, thanks for your paper!

I'm running the training code and I have a question. I get occasional spikes in rtime that drastically slow down training. For example:

16hv_8_96_I4_5e-4_A_debug_15:30:12; step 000024/200000; rtime 0.06; itime 1.95; loss = 67.71947
16hv_8_96_I4_5e-4_A_debug_15:30:12; step 000025/200000; rtime 0.05; itime 1.91; loss = 107.49207
16hv_8_96_I4_5e-4_A_debug_15:30:12; step 000026/200000; rtime 0.06; itime 1.86; loss = 106.29369
16hv_8_96_I4_5e-4_A_debug_15:30:12; step 000027/200000; rtime 0.03; itime 2.02; loss = 105.84137
16hv_8_96_I4_5e-4_A_debug_15:30:12; step 000028/200000; rtime 0.03; itime 1.83; loss = 167.15781
16hv_8_96_I4_5e-4_A_debug_15:30:12; step 000029/200000; rtime 0.03; itime 2.09; loss = 111.99800
16hv_8_96_I4_5e-4_A_debug_15:30:12; step 000030/200000; rtime 0.06; itime 2.06; loss = 102.14317
16hv_8_96_I4_5e-4_A_debug_15:30:12; step 000031/200000; rtime 0.03; itime 1.80; loss = 100.11551
warning: sampling failed
16hv_8_96_I4_5e-4_A_debug_15:30:12; step 000032/200000; rtime 210.16; itime 212.36; loss = 76.14967
16hv_8_96_I4_5e-4_A_debug_15:30:12; step 000033/200000; rtime 0.05; itime 2.00; loss = 106.69480
16hv_8_96_I4_5e-4_A_debug_15:30:12; step 000034/200000; rtime 0.05; itime 1.91; loss = 107.20820
16hv_8_96_I4_5e-4_A_debug_15:30:12; step 000035/200000; rtime 0.05; itime 1.97; loss = 82.02447
16hv_8_96_I4_5e-4_A_debug_15:30:12; step 000036/200000; rtime 0.05; itime 2.22; loss = 80.11346
16hv_8_96_I4_5e-4_A_debug_15:30:12; step 000037/200000; rtime 0.07; itime 1.92; loss = 124.75398
16hv_8_96_I4_5e-4_A_debug_15:30:12; step 000038/200000; rtime 0.02; itime 1.85; loss = 89.14218
16hv_8_96_I4_5e-4_A_debug_15:30:12; step 000039/200000; rtime 25.70; itime 27.52; loss = 126.53448
16hv_8_96_I4_5e-4_A_debug_15:30:12; step 000040/200000; rtime 0.03; itime 1.60; loss = 127.69047
16hv_8_96_I4_5e-4_A_debug_15:30:12; step 000041/200000; rtime 51.45; itime 53.37; loss = 80.23492
16hv_8_96_I4_5e-4_A_debug_15:30:12; step 000042/200000; rtime 0.05; itime 1.89; loss = 125.51277
16hv_8_96_I4_5e-4_A_debug_15:30:12; step 000043/200000; rtime 0.05; itime 1.84; loss = 75.43661
16hv_8_96_I4_5e-4_A_debug_15:30:12; step 000044/200000; rtime 0.03; itime 1.95; loss = 90.89172
16hv_8_96_I4_5e-4_A_debug_15:30:12; step 000045/200000; rtime 0.03; itime 1.79; loss = 119.69950
16hv_8_96_I4_5e-4_A_debug_15:30:12; step 000046/200000; rtime 0.04; itime 1.96; loss = 75.55309
16hv_8_96_I4_5e-4_A_debug_15:30:12; step 000047/200000; rtime 0.03; itime 1.79; loss = 83.27483

As you can see, rtime is usually small, but once every ~20 iterations it is very large. I am using a small N (N=96) and all other parameters are set at their defaults so that should not be the issue. My full training command is python train.py --N=96.

Have you experienced this before? If not, do you have any thoughts on this sort of issue?

Thank you for your help!

Adam W. Harley · Answer 1 · Sun Feb 12 2023 05:28:09 GMT+0800 (China Standard Time)

Yes, what that means is that the dataloader fell behind and needs time to catch up. What you may be able to do is increase nworkers, so that more data gets prepared in parallel, or increase N, so that the model needs to work longer on each sample. Sometimes it's a good strategy to increase N to a very high number (like 512 or 768), and put B=1, to minimize the work for the dataloader and maximize work for the model.

greeneggsandyaml · Answer 2 · Tue Feb 14 2023 22:45:06 GMT+0800 (China Standard Time)

Thanks for the response! I will try out your suggestions and let you know how it goes :)