GPU memory / joint training
Timo-hab opened this issue · comments
I am trying to reproduce the results for the cityscapes dataset. I am now at the joint training and in the paper says the crop size was 1396x1396 px (half image + label margin)(batch size =1). Surprisingly, this exceeds 12 GB GPU memory, so i cant start the training. 1180x1180px is the max value, already running it in virtual console.
Can that be related to the cudnn version? I tried version 2, 3, 4, 5 and 5.1 and could not observe a different memory requirement. In all cases 1180x1180px were the max.
Thats kind of strange, isnt it? I would suspect a difference when using different cudnn versions.
Could you please give me some advice? Thank you in advance!
I have a similar issue. I can only run the program in CPU mode because it requires about 18 Gb in memory.
@Timo-hab @chenqifeng22 test_batch has to be 0 for cityscape joint training. Was it set correctly?
I only tried predict.py for a single image. I get "out of memory" error with Titan X.
@chenqifeng22 You have to check your setup. A lot of people have confirmed to me that predict.py can work well.
@fyu I need to install cuDNN to make it work.
@fyu thank you for your answer. test_batch wasnt set to 0. Now i can start the training, but dont see test results of course.
Is it possible to test the model on validation dataset while training, so i can see accuracy and loss, without the need for more GPU memory?
I also started to play around with the code a bit and wanted to test it on the cityscapes testimage. Not training, just the predict.py.
And it shows me there out of memory when I try it on my gpu with --gpu 0. Did you guys find a solution for it? Is really 18GB of GPU memory required for it? If so - is there any possibility to reduce the memory footprint at testing time? I use cudnn v5.1 and CUDA 8.0.
By the way: It works on KITTI with GPU and on cityscapes with CPU only.
@manuelschmidt Depending on the model of GPU, the dilation network needs different amount of memory (which is quite strange to me).
There is a simple solution. Just make the input image size smaller in the prototype file. Also you need to change " if dataset.zoom > 1:" to "if dataset.zoom >= 1:" in the predict.py to prevent a bug.
@fyu I am training the ADE20K dataset on the Dilation network with the frontend mode in GPU with 8 GB memory and I get the error message 'out of memory'. Is 8 GB enough for the training?
My training batch is set as 10 and test batch 0.
Thank you
@qualia0000 Besides the batch size, the memory consumption also depends on the crop size.