neuralnetworks is 2000 times slower using GPU than Theano using CPU

Question

neuralnetworks is 2000 times slower using GPU than Theano using CPU

joelself opened this issue 8 years ago · comments

At first I couldn't get GPU runtimes to be any faster than CPU runtimes on neuralnetworks but eventually I got the GPU run faster, but only by making huge networks that would take forever to complete for example I modified the testLenetSmall function to have this network:

NeuralNetworkImpl nn = NNFactory.convNN(new int[][] { { 28, 28, 1 }, { 5, 5, 120, 1 }, { 2, 2 }, { 5, 5, 120, 1 }, { 2, 2 },  { 3, 3, 120, 1 }, { 2, 2 }, {2048}, {2048}, {10} }, true);

Basically I added a 3rd convolutional net, bumped up the number of filters in in all covnets to 120 (from 20 and 50), quadrupled the neurons in the final hidden layer and added another hidden layer with 2048 neurons. The GPU enabled version runs about 2.4 times faster, but it's still dog slow taking something like 12 - 14 seconds per batch (the batch size is 1) so training the entire dataset of 60000 images would take 8.3 to 9.7 days. So like 10 days per epoch on the GPU. Meanwhile I built a comparable network in Lasagne/Theano and it takes around 420 seconds per epoch on the CPU (in a VM at that) which is about 2000 times faster.

Ivan Vasilev · Answer 1 · Fri Aug 05 2016 05:22:58 GMT+0800 (China Standard Time)

An answer - with a huge delay, for which I don't have an excuse. There are several reasons why this library is significantly slower than other libraries (including Theano) and they are somehow related to the use of Aparapi:

The memory management in Aparapi is very limited - most of the other libraries use highly optimized gpu kernels, usually implemented with CUDA (cudnn being by far the most popular). CUDA provides the ability to rearrange the gpu arrays in a way, which greatly increases the computational speed (this is especially true for convolutional operations). Aparapi simply doesn't have that.
Another serious limitaion of Aparapi is that if there is a chain of gpu kernels (operations), where the input of one operation is the output of the previous operation (which is the case with most neural networks) it is not possible to "contain" this communication within the gpu. This means that the output of one operation is first transfered from the gpu memory to the general RAM and then transfered back from the RAM to the gpu memory to serve as input of the next operation. Unfortunately this greatly reduces the performance.

In conclusion I would say that when I started working on the library I was not aware of any of these limitations and my goal was to introduce myself to the deep learning field and produce something meaningful in the same time. Additionally, I tried to create something which could run on any hardware - thus using java and opencl. At that time cudnn didn't exist and the ony other deep learning library that I was aware of was cuda-convnet, so I didn't have much of a choice anyway. I hope that someday in the future I would be able to port the library to use cudnn.