Faster on 84GHz / 32G Mem than 322GHz / 64G Mem ?

Question

Faster on 84GHz / 32G Mem than 322GHz / 64G Mem ?

0000sir opened this issue 8 years ago · comments

I've tested these on a computer with AMD FX(tm)-8350 Eight-Core Processor, with 32G RAM installed, with test.lua, I can get a stylized.jpg in 6 seconds. After reading this post #41
I think it's possible to generate high resolution images if I have larger RAM.
I got a VM running on XenServer, with 32 CPU core(Intel(R) Xeon(R) CPU E7- 4820 @ 2.00GHz) and 64G RAM, but when I ran test.lua with regular params, it was extreamly slow than the former computer.
It costs over 15 minutes to generate one single image. What's wrong with that?
I noticed the script runs on only single one CPU core, is that normal?

This is what I used:
th test.lua -input_image images/forbidden_city.jpg -model_t7 data/checkpoints/model.t7 -cpu

Any body have experience on this?

Thanks.

0000sir · Answer 1 · Mon Sep 19 2016 21:42:07 GMT+0800 (China Standard Time)

Time decreased to about 20 seconds after Torch reinstalled. Maybe I missed something during last installation.
But the test.lua use one CPU only, how to use all of my CPUs ?

Dmitry Ulyanov · Answer 2 · Tue Sep 20 2016 13:29:03 GMT+0800 (China Standard Time)

Torch should use OpenMP for parallel computations, the fact that only one core was used tells something went wrong with installation. For me it uses all the cores. Try manually specifying OMP_NUM_THREADS environment variable.

0000sir · Answer 3 · Tue Sep 20 2016 15:21:59 GMT+0800 (China Standard Time)

Thank you @DmitryUlyanov , but have no luck with OMP_NUM_THREADS=64
And torch can get number of threads as 64:

th> print(torch.getnumthreads())
64

but it still runs on one cpu core, any advise will be thankful

0000sir · Answer 4 · Tue Sep 20 2016 15:34:09 GMT+0800 (China Standard Time)

if I run code below, it will use all of 64 cores

require 'torch'

local a = torch.FloatTensor(1000,1000)
local b = torch.FloatTensor(1000,1000)

for i=1,1000 do
  local c = torch.mm(a,b)
end

Dmitry Ulyanov · Answer 5 · Tue Sep 20 2016 20:50:07 GMT+0800 (China Standard Time)

This is strange. Convolution implementation uses matrix mul, so neural nets should be parallel as well..

0000sir · Answer 6 · Wed Sep 21 2016 15:07:29 GMT+0800 (China Standard Time)

It's strange that all of my CPUs working with neural-style, but not for texture_nets.
Still don't know why.

Dmitry Ulyanov · Answer 7 · Wed Sep 21 2016 15:10:53 GMT+0800 (China Standard Time)

Hm, it can be because of using threads for dataloader. I have no idea how to deal with it.

tisawe · Answer 8 · Sun Sep 25 2016 07:19:27 GMT+0800 (China Standard Time)

Try reinstalling numpy using the code from github. Also check which BLAS library you have installed and report you findings here.

ink1 · Answer 9 · Wed Sep 28 2016 00:37:17 GMT+0800 (China Standard Time)

I have a problem with CPU load too. However I've found it goes up when the batch size is increased. By default, the batch size is 4 and the CPU load peaks at ~400% but with 12 %CPU peaks at 1200%). Unfortunately this pushes the memory up as well. Therefore even if you can afford to increase the batch size doing so is generally not very good for getting the best CPU performance but it's probably better than not using the idle cores.

ink1 · Answer 10 · Wed Sep 28 2016 00:38:53 GMT+0800 (China Standard Time)

@0000sir By the way, this behaviour explains the performance you observed.

Faster on 8*4GHz / 32G Mem than 32*2GHz / 64G Mem ?

Faster on 84GHz / 32G Mem than 322GHz / 64G Mem ?