GPU memory / joint training

Question

GPU memory / joint training

Timo-hab opened this issue 8 years ago · comments

I am trying to reproduce the results for the cityscapes dataset. I am now at the joint training and in the paper says the crop size was 1396x1396 px (half image + label margin)(batch size =1). Surprisingly, this exceeds 12 GB GPU memory, so i cant start the training. 1180x1180px is the max value, already running it in virtual console.

Can that be related to the cudnn version? I tried version 2, 3, 4, 5 and 5.1 and could not observe a different memory requirement. In all cases 1180x1180px were the max.
Thats kind of strange, isnt it? I would suspect a difference when using different cudnn versions.
Could you please give me some advice? Thank you in advance!

Qifeng Chen · Answer 1 · Tue Sep 20 2016 04:32:55 GMT+0800 (China Standard Time)

I have a similar issue. I can only run the program in CPU mode because it requires about 18 Gb in memory.

Fisher Yu · Answer 2 · Tue Sep 20 2016 04:35:35 GMT+0800 (China Standard Time)

@Timo-hab @chenqifeng22 test_batch has to be 0 for cityscape joint training. Was it set correctly?

Qifeng Chen · Answer 3 · Tue Sep 20 2016 04:56:14 GMT+0800 (China Standard Time)

I only tried predict.py for a single image. I get "out of memory" error with Titan X.

Fisher Yu · Answer 4 · Tue Sep 20 2016 04:59:08 GMT+0800 (China Standard Time)

@chenqifeng22 You have to check your setup. A lot of people have confirmed to me that predict.py can work well.

Qifeng Chen · Answer 5 · Tue Sep 20 2016 08:41:13 GMT+0800 (China Standard Time)

@fyu I need to install cuDNN to make it work.

Timo-hab · Answer 6 · Sat Oct 01 2016 23:53:43 GMT+0800 (China Standard Time)

@fyu thank you for your answer. test_batch wasnt set to 0. Now i can start the training, but dont see test results of course.
Is it possible to test the model on validation dataset while training, so i can see accuracy and loss, without the need for more GPU memory?

Manuel Schmidt · Answer 7 · Tue Nov 29 2016 05:08:25 GMT+0800 (China Standard Time)

I also started to play around with the code a bit and wanted to test it on the cityscapes testimage. Not training, just the predict.py.
And it shows me there out of memory when I try it on my gpu with --gpu 0. Did you guys find a solution for it? Is really 18GB of GPU memory required for it? If so - is there any possibility to reduce the memory footprint at testing time? I use cudnn v5.1 and CUDA 8.0.

By the way: It works on KITTI with GPU and on cityscapes with CPU only.

Qifeng Chen · Answer 8 · Tue Nov 29 2016 06:05:25 GMT+0800 (China Standard Time)

@manuelschmidt Depending on the model of GPU, the dilation network needs different amount of memory (which is quite strange to me).
There is a simple solution. Just make the input image size smaller in the prototype file. Also you need to change " if dataset.zoom > 1:" to "if dataset.zoom >= 1:" in the predict.py to prevent a bug.

Derek · Answer 9 · Tue Apr 25 2017 14:43:28 GMT+0800 (China Standard Time)

@fyu I am training the ADE20K dataset on the Dilation network with the frontend mode in GPU with 8 GB memory and I get the error message 'out of memory'. Is 8 GB enough for the training?
My training batch is set as 10 and test batch 0.
Thank you

Fisher Yu · Answer 10 · Tue Apr 25 2017 14:45:24 GMT+0800 (China Standard Time)

@qualia0000 Besides the batch size, the memory consumption also depends on the crop size.

Derek · Answer 11 · Tue Apr 25 2017 15:10:18 GMT+0800 (China Standard Time)

@fyu Thank you for your reply. The crop size is set as 500. I'll try methods to make the consumption smaller.