soumith / convnet-benchmarks

Easy benchmarking of all publicly accessible implementations of convnets

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Benchmark TensorFlow

soumith opened this issue Β· comments

Google's TensorFlow benchmarks are here!

I've run the benchmarks on the Imagenet Winners.
When I saw issues with the numbers, memory etc., I emailed @Yangqing to confirm what I'm seeing, and that it is expected.

With that disclaimer out of the way, here's some things that you should know about TensorFlow (as of the pip version that I installed today):

  • in-place ReLU seems non-existent in practice.
    • Yangqing says: "right now there are little in-place operations in TensorFlow and we pretty much rely on the scheduler and the memory pool to allocate and deallocate memory"
  • Supports CuDNN R2. No R3 support yet, Yangqing says the next version they are going to support is likely R4.

Coming to the benchmarks:

  • Googlenet with batchsize 128 goes Out of Memory. The largest batch-size I could fit is 16 (tried 16, 32, 64, 128)
  • VGG with batchsize 64 goes Out of Memory (Edit: VGG memory issue was solved by using the BFC allocator updated by GOOG). The largest batch-size I could fit is 32 (tried 32, 64).
  • I've also computed Torch7+CuDNN-R2 baselines for these batch-sizes.

AlexNet (One Weird Trick paper) - Input 128x3x224x224

Library Time (ms) forward (ms) backward (ms)
CuDNN-R3 (Torch) 96 32 64
Nervana (Neon) 101 32 69
CuDNN-R2 (Torch) 231 70 161
TensorFlow 326 96 230

Overfeat [fast] - Input 128x3x231x231

Library Time (ms) forward (ms) backward (ms)
CuDNN-R3 (Torch) 326 113 213
fbfft (Torch) 342 114 227
CuDNN-R2 (Torch) 810 234 576
TensorFlow 1084 316 768

OxfordNet [Model-A] - Input 64x3x224x224

Library Time (ms) forward (ms) backward (ms)
Nervana 590 180 410
CuDNN-R3 (Torch) 615 196 418
CuDNN-R2 (Torch) 1099 342 757
TensorFlow 1840 545 1295

GoogleNet V1 - Input 16x3x224x224

Library Time (ms) forward (ms) backward (ms)
CuDNN-R2 (Torch) 564 174 390
TensorFlow 590 54 536

Note that at batch size of 16, googlenet with CuDNN-R2 + Torch likely runs into dispatching overhead, so it's an exotic comparison, but not practically very interesting or encouraging.

There you go.

I'm assuming that the first release of TensorFlow is still quite unpolished, and that they will improve it over time with various memory and time optimizations baked in.

The benchmark scripts and raw outputs are located here: https://github.com/soumith/convnet-benchmarks/tree/master/tensorflow

The lack of in place operations is rather surprising. Once you have the full DAG it should be rather easy to apply a liveness algorithm to it to optimize tensor allocations. For an example see this: http://www.diku.dk/hjemmesider/ansatte/torbenm/ICD/Register.pdf (just replace register with tensor).

I'm kind of curious if there's any support for automatically compounding operations together or of leveraging kernels that have some compounding built in (like the alpha/beta params of gemm). I'm pretty close to maximizing the amount of compounding that's possible in my benchmark networks. And because I write all my own kernels I can further compound things that aren't possible with closed source libraries like cuDNN. For example, I'm now able to compute the mean along the PQN dimension inside the conv and gemm kernels at no cost. This cuts down the bandwidth required by batch norm in fprop by a third.

Though on the whole I think TensorFlow seems like a great platform to build on. I'd say there's a good chance my kernels will make their way there sooner rather than later. You can find new benchmarks of my latest winograd kernels in the updated paper here: http://arxiv.org/abs/1509.09308

What I'll be working on next is basically going to be taking a lot of what I learned implementing winograd and refreshing all of my conv/pooling/gemm kernels to support very small minibatches at near full utilization. This should have a big impact on the level at which you can scale these networks and the speed at which they converge. Here's a great paper exploring this: http://arxiv.org/abs/1509.04210

Hi, I strongly recommand to add mxnet https://github.com/dmlc/mxnet into comparision which in my opinion may be the fastest DL library :)

+1 for benchmarking mxnet, the fastest now.

+1 for benchmarking mxnet

I would also love to see a comparison with Theano http://deeplearning.net/software/theano/ as it is another widely adopted deep learning library.

Thanks for benchmarking!

+1 would love to see tensorflow benchmarked against mxnet, Theano, Autograd for Torch, and Caffe.

Thanks @soumith! Yes, our only launch criterion for convnets was 'GoogLeNet within distance from CuDNN[R2]', and we've punted on a lot of performance work, including upgrading CuDNN, until after the initial release. Expect a lot of movement on that front in the coming weeks.

@aaronwro @fvisin it's already benchmarked against Torch, Theano, Caffe. Look at the readme on the main page ( https://github.com/soumith/convnet-benchmarks/blob/master/README.md ).
I definitely need to pull my socks up and benchmark MXNet and Chainer.

@vincentvanhoucke thanks for your response. I assumed that you'll fix it over the next weeks / months :)

@scott-gray let us know if you need help with compounding / graph rewriting. The graph representation is meant to make these kinds of operations possible, and the common subexpression elimination that TF currently uses is also meant as a demonstration of that. I suspect we might need to do a bit more to provide good APIs to make it easier to bake in compound kernels.

there seems to be some misinterpretation by random people in social media that because I work for Facebook, I'm attacking TensorFlow. That seems super weird, because I love the vision of TensorFlow, and there's no competition (one can write a XXX frontend for a TensorFlow backend).

My benchmarks have always been independently run, and completely neutral, I've been running them forever now, sad that people misinterpret the slightest of things.
cc: @vincentvanhoucke

will defend Soumith on this one – he has indeed been running these
benchmarks for quite some time, and complete neutrality.

On Wed, Nov 11, 2015 at 11:33 AM, Soumith Chintala <notifications@github.com

wrote:

there seems to be some misinterpretation by random people in social media
that because I work for Facebook, I'm attacking TensorFlow. That seems
super weird, because I love the vision of TensorFlow, and there's no
competition (one can write a XXX frontend for a TensorFlow backend).

My benchmarks have always been independently run, and completely neutral,
I've been running them forever now, sad that people misinterpret the
slightest of things.
cc: @vincentvanhoucke https://github.com/vincentvanhoucke

β€”
Reply to this email directly or view it on GitHub
#66 (comment)
.

@soumith Excellent, thank you!!

@soumith no good deed goes unpunished ;) Please don't let this deter you from providing this valuable service to the community!

@soumith , I am sorry that some people interpreted things that way. I've always appreciated your benchmark, which creates a great atmosphere for us to look at bottlenecks and push forward the field as a whole community. We all owe you a big debt of gratitude.

As always, that's super interesting. Thanks for pushing all of us toward more performance.

For memory optimizations, what we have found is that inplace optimization does not matter that much, if the allocator is smart enough to do a static allocation before running the graph(as opposed to relying on a dynamic allocator). We have detailed what can be done here

https://mxnet.readthedocs.org/en/latest/developer-guide/note_memory.html

Which I assume applies to computation graph frameworks such as TF, caffe2 and CGT.
@vincentvanhoucke @Yangqing

The general idea is not only to share memory of same shape(i.e. inplace) , but also different shapes and size

@soumith Thanks for running the benchmarks! As @vincentvanhoucke noted in this thread, our goal was to get an early release out so users can start playing with it and provide feedback on what they care about. We are committed to making TensorFlow fast and are actively working on the performance issues you highlight here.

@soumith You're doing a good deed! Haters gonna hate.

I'm a little confused by the number. 1300 samples/sec seems too fast even for alexnet on single TitanX. Is this standard training, e.g. io+forward+backward+update, or just forward+backward?

Nice work.

@piiswrong I will help @soumith make the benchmark script.

Anyway we opened everything since beginning. The main purpose is learning from each other but not advertise boring number.

I will also add my support to Soumith. He has been running these benchmarks for sometime with complete transparency and neutrality.

@koraykv +1, thanks Soumith!

Someone on reddit suggested that I build tensorflow from source, to fix speed issues. That did not help, It produces the same numbers as the pip version on my alexnet script :

https://gist.github.com/soumith/11acc2f0dbc5212ea372

FWIW, Yangqing's fix to avoid CPU-GPU transfers improved results across the board by ~20%. (I've updated the tables above). The memory issues are unchanged.

+1 for mxnet! Thanks.

+1 for mxnet.

@soumith I have a naive question, is the Tensor Flow's result based on c++ code or cuDNN v2? I would guess if you run on Titanx tensor flow will rely on some CUDA library?

@gujunli it's based on CuDNN V2.

@soumith thanks for running and maintaining these benchmarks; they're always thorough and informative!

@soumith Then I don't understand Why Tensor Flow with cuDNN v2 ends up being so slow? Can you share some of your understanding? I will guess TF still calls cuDNN v2 for the conv/pool/relu/FC layers. Remember from your earlier AlexNet results, cuDNN v2 is 231=70+161, Caffe (native) ConvolutionLayer 324=121+203. However Tensor flow is 326=96+230.

Running the network under nvvp (nvidia visual profiler) should be pretty informative. A well tuned network timeline should just be a solid block of kernel calls with no gaps.

@scott-gray so you think TF scheduling may not be efficient? I need to read TF whitepaper to understand how it works. Any one understands?

@gujunli I'm just saying if they're just using stock cuDNNv2 then the only reason it would be slower is if there were gaps in the timeline. Seeing where those gaps occur and any extra host/device memcpy traffic would give you a clearer picture of what's going wrong.

@soumith Thanks for this and all the other previous benchmark you took the time to create.

+1 for MxNet

+1 for mxnet! Thank you so much!!!

@gujunli @scott-gray To provide some historical perspective: this is mostly due to legacy choices. Historically, Google Brain has been using the NHWC storage order and a slightly different padding scheme ("SAME/VALID" instead of an explicit padding number). CuDNN, as well as Caffe, uses NCHW order. Note that CuDNN support NHWC interface-wise, but some underlying paths are not implemented, like NHWC convolution backward.

As a result, when calling cuDNN, there are some code that generates intermediate padded and order-switched intermediate tensors. The code was written with Eigen and did not interact very well with nvcc, causing a nontrivial overhead (you can observe that by running the benchmark in an nvvp session as Scott suggested).

We are having people looking into this and the performance should be brought to cuDNN-level.

Gah, everyone's using different tensor layouts still. You all need to turn from the dark side and see the speed benefits to using CHWN. Though NHWC is probably better than NCHW at least. You want that inner dimension to be a nice even number to facilitate cleaner alligned memory access, leading to less over-fetch. CHWN gets you better contiguous memory access over all. In recurrent networks with model parallelism having N as the outer dim definitely helps, but most distributed convnets are data parallel where it doesn't matter.

I have some very fast shared memory dimshuffle code if you want it. I use it to make this operation on the filters:

# C <=> K and mirror R,S
F = np.transpose(F[:,::-1,::-1,:], (3,1,2,0))

Turns out a kernel for fprop_conv can work with very little change (or no change if padding and striding are symmetric) to be a kernel for bprop_conv. There's almost no overhead in the dimshuffle since the filters are so small and you completely avoid any atomic adds.

krizhevsky first demonstrated the benefits of using the CHWN layout in cuda-convnet. In addition to being advantageous for convolutional kernels, it's very beneficial for models like GoogLeNet where inception modules concatenate activations along feature map depth. Using CHWN allows you to write directly into an output buffer in the layout that the subsequent layer will consume (C1 + C2 + C3)HWN.

Thanks @scott-gray - having the dim shuffle kernels to improve performance will be great.

One potential issue with CHWN is that during inference time N is often small, so there are two different sets of optimizations to be carried out for large N and small N. NCHW/NHWC usually makes things a bit batch-agnostic, but that's not always true of course.

@soumith Regarding the memory issue, we found that if one turns on the best-fit GPU allocator, you would be able to run VGG on batch sizes of 64. I did a quick change if you would like to build and try from source:

git clone https://github.com/Yangqing/tensorflow.git
cd tensorflow
git checkout bfc

There will be more fixes to be submitted by @vrv to enable it more easily (such as during a session creation time) down the road.

@Yangqing The shuffle code is here (note that this does not do the RS mirror operation):
https://github.com/NervanaSystems/neon/blob/master/neon/backends/float_ew.py#L1481

It uses magic numbers for fast integer division. Here's the code that sets up the kernel params:
https://github.com/NervanaSystems/neon/blob/master/neon/backends/layer_gpu.py#L504

The code is adapted from here (the diagram will be helpful):
http://devblogs.nvidia.com/parallelforall/efficient-matrix-transpose-cuda-cc/

It's on my list of things to do to generalize it and make it available as a flexible backend operation. But I haven't gotten to it. Theano may also have some good dimshuffle code you can borrow.

Also, most of the code in that float_ew file is devoted to automatically generating extremely efficient compound elementwise/reduction/broadcast/transpose/take operations. It allows you to write complex numpy expressions and have them compile to a single cuda kernel. It even does common sub-expression removal, but sounds like you already have that. This all works off of little optrees that exist in layer code. But I've been meaning to find a way to collect the full program DAG in a clean way. Seems like you guys solved that and that's why I'm interested in TensorFlow. There's so much burden you can shift from the programmer and have automatically optimized via graph traversals.

+1 for mxnet. Dynamic GPU memory allocation does have a big impact on performance. A simple memory allocator can significantly reduce the overhead. A smarter allocator which reuses blocks with best-fit can almost eliminate the overhead completely.

@soumith I just pushed tensorflow/tensorflow@1d76583, which should allow you to use our best-fit-with-coalescing allocator via the ConfigProto.

Example usage here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/models/image/alexnet/alexnet_benchmark.py#L201

We were able to get some of the larger batch sizes working with the BFC Allocator, so probably worth a try.

(We plan to make the BFC allocator the default soon, but it's not fully ready yet to be the default).

Thanks a lot @soumith for the numbers, super useful!

The creators of cuDNN [1] may help with the performance optimization. @bryancatanzaro

[1] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer. cuDNN: Efficient Primitives for Deep Learning. arXiv preprint arXiv:1410.0759, 2014.

hjk41 wrote:

+1 for mxnet. Dynamic GPU memory allocation does have a big impact on performance. A simple memory allocator can significantly reduce the overhead. A smarter allocator which reuses blocks with best-fit can almost eliminate the overhead completely.

Hi mxnet guys, the title of this thread is 'Benchmark TensorFlow' ;-) I think you could create a new issue to request mxnet, using https://github.com/soumith/convnet-benchmarks/issues/new

[edit: looks like Soumith has created an issue for mxnet here: https://github.com//issues/68 ]

I was curious about the performance of Tensorflow using CUDA 7.5 and CUDNN 7.0. I modified the build to use them and rebuilt source. I then ran @soumith benchmark scripts for alexnet and overfeat.
My PC is getting old (Intel Core 2 quad), 16 GB RAM, NVIDIA Titan X, Ubuntu 15.04 x86_64

Alexnet: forward/backward 290ms, forward 78ms. (1.12x improvement for f/b)
Overfeat: forward/backward 1040ms, forward 264ms (1.04x improvement for f/b).

So not much of a speedup by only swapping the CUDA libraries.

@soumith, one thing I did notice is that your benchmarks for Caffe are quite different from what I got using CUDA 7.5. I think the benchmarks you used are with CUDA 7.0, right?. When I ran your "run_imagenet.sh" script on my setup, I got much better results.
Alexnet: forward/backward 171ms, forward 41ms ( 1.89x improvement for F/B)
Overfeat: forward/backward 601ms, forward 133ms ( 1.37x improvement)
Googlenet: F/B 624ms, F 174ms ( 3.1x improvement)

It's not clear to me if CUDA 7.5 is supported for Caffe, but in https://github.com/BVLC/caffe/wiki/Installation, they provide 7.0 and 7.5 Docker images. However, in the main installation instructions they say to use 7.0

I attached the log files for his benchmarks running on my PC.

Tensorflow:

tensorflow_alexnet.txt
tensorflow_overfeat.txt

Caffe:

output_alexnet.txt
output_googlenet.txt
output_overfeat.txt
output_vgg_a.txt

@soumith Thanks for an invaluable community service!

commented

Nice!

@soumith Thank you for benchmarking!!

@scott-gray Sorry to spam @soumith's TF discussions, but when I last played with integer division via magic number mul-and-shift on GPU, the performance I got (on K40 though) was about the same as straightforward division by unsigned int32; the compiler seemed to have strength reductions that it performed in this case. However, there was lower register usage, so using this technique in a kernel that actually does other things (like transposition) would probably help. This was at the SASS level on CUDA 6.5? though.

facebookarchive/fbcuda@d5c8b38

The compiler does a good job when the constant is known in advance. Was it your case?

Nice! Thank you @soumith

@wickedfoo The advantage of calculating the magic numbers manually is that the divisor is typically parameterized and so the compiler can't compute magic numbers ahead of time. So it then falls back to to using the floating point rcp operator and doing a bunch of corrections to make up for the potentially shorter mantissa of float (23 vs 32 bits).

To do an integer division and modulus with magic numbers reduces to just this code:

// j   = jrst / RST
// rst = jrst % RST
int j   = jrst * magic_RST; j >>= shift_RST;
int rst = jrst - j * RST;

If you know all those numbers fit in 16 bits you can use vmad from ptx or sass. That looks like this:

VMAD.U16.U16 j, jrst, magic_RST, RZ;
SHR.U32      j, j, shift_RST;
VMAD.U16.U16 rst, -j, RST, jrst;

Otherwise your multiplications are going to expand out to 3 XMADS each, regardless of the datatype used. It would be nice if the compiler was a little smarter about multiplication by using the minimal number of instructions for the given data types.

For larger values that might require 64 bit math, I use something like this:

      MOV  magicPQ,    param_magic_PQ;
      IADD negPQ, RZ, -param_grid_PQ;

      ISETP.NE.AND P1, PT, magicPQ, 1, PT;

      // m = blkMPQ / PQ
  @P1 XMAD     div1, blkMPQ,    magicPQ,    RZ;
  @P1 XMAD     div2, blkMPQ,    magicPQ.H1, RZ;
  @P1 XMAD     div3, blkMPQ.H1, magicPQ.H1, RZ;
  @P1 XMAD.CHI div1, blkMPQ.H1, magicPQ,    div1;
  @P1 IADD3.RS m, div1, div2, div3;
  @P1 SHR.U32  m, m,      param_shift_PQ;
 @!P1 SHR.U32  m, blkMPQ, param_shift_PQ;

      // pq = blkMPQ % PQ
      XMAD pq, negPQ, m, blkMPQ;
      XMAD.PSL pq, negPQ.H1, m, pq;

Integer division is essential for these multi dimensional tensors where you cant fit everything in just 3 block coordinates. For more advanced uses, you can leverage it to pack all your coordinates into a single blockIdx.x value, then completely remap the order in which the indexes are scheduled. I'm able to achieve 95% L2 hit rates using this in my winograd kernels. This is essential for good performance as the small 32x32 batched gemm tile is pretty high bandwidth.

Is there any benchmark showing the predictive performance? Computing fast but inaccurate predictions does not seem useful.

@scott-gray TensorFlow already uses fast integer division using the code in http://github.com/tensorflow/tensorflow/blob/master/third_party/eigen3/unsupported/Eigen/CXX11/src/Tensor/TensorIntDiv.h. One of the issues is that a lot of the TensorFlow kernels use 64 bit integers to index tensors, which ends up slowing things down on GPU. This is being fixed.

karenyyng wrote:

Is there any benchmark showing the predictive performance? Computing fast but inaccurate predictions does not seem useful.

Ideally, they are all learning the exact same model, so the outputs should be identical (to within the bounds of rounding accuracy). A correctness check is not a bad idea though.

@vrv thanks a lot, trying out the BFC allocator now for vgg and googlenet models.

@ces-bertino the numbers I have with Caffe are with Caffe's native kernels (that's why the entry is marked as Caffe (native) ", I presume you have CuDNN, and hence have the speedups. To compare your entries, look at the entry marked as CuDNN

@vrv updated the table for VGG.
Googlenet still goes OOM at batch size 128, but if my memory is right, it's really tight on space to do a batch size of 128 Googlenet in 12GB, and one needs in-place ops for sure.

@Yangqing, @benoitsteiner do you have a sense for how performance for these benchmarks depends on nvcc vs gpucc? Are the 10%-50% numbers in http://llvm.org/devmtg/2015-10/slides/Wu-OptimizingLLVMforGPGPU.pdf for ic1/ic2 applicable here?

@ajtulloch gpucc does hide these latency issues vs nvcc as it seems to do a much better job at optimization. Using gpucc brings TensorFlow pretty close to the cuDNN[R2] numbers for AlexNet.
We are working on bridging that gap for nvcc by addressing a number of specific issues that @benoitsteiner and @Yangqing mentioned earlier.

@ajtulloch The 2 main reasons why gpucc can generate faster code than nvcc are:

  • The fact that gpucc can replace 64 bit integer divisions with 32 bit divisions if the values stored in the 64 bit integers can actually fit in 32 bit. As we update the TensorFlow convolution kernels to use 32 bit indices the performance of the code generated by nvcc will start to approach that of the code generated by gpucc.
  • The fact that clang supports c++11 constant expressions much better than nvcc. Constant expressions allow us to generate much more efficient CUDA kernels. Unfortunately for the time being we have to disable this feature since the corresponding code doesn't compile with nvcc. I am rewritting the corresponding code to make it compatible with nvcc 7.5, and hopefully with nvcc 7.0 as well.

@rajatmonga, @benoitsteiner that makes sense, thanks for that.

@benoitsteiner I'm curious how you guys are using integer division in your implementation. The only places I find a need to use it are in custom kernels where I'm unpacking multiple coordinates from a compound index.

On a related note, I should mention that I also have another simple technique I developed for when you don't know ahead of time the value of the divisor. It looks something like this:

// rcpRST = 1 / RST
I2F.F32.S32 rcpRST, RST;
MUFU.RCP rcpRST, rcpRST;

// c = crst / RST
I2F.F32.S32 crst, crst;
FMUL c, crst, rcpRST;
FFMA c, c, 5.9604644775390625e-08, c;
F2I.S32.F32.TRUNC c, c;

// rst = crst % RST
VMAD.U16.U16 rst, -c, RST, crst;

For most values the floating point reciprocal gets you the correct value. It's just when the numerator and denominator are very close that you need to correct for the missing precision in float32. This is a lot less code than the compiler would generate and is accurate for the range of values I need it for.

are in custom kernels where I'm unpacking multiple coordinates from a compound index

Somewhat related: question that I've been wondering about somewhat, and never quite got around to measuring: if we have two 8-bit integers, is it faster to store in separate registers/variables, or faster to pack into one register/variable, using bit-shifting? (Edit: I suppose this is a bit vague really, since it entirely depends on how they're being used ... but I guess the trade-off I'm thinking about is: packing multiple values into a few registers will reduce register pressure, but maybe the increase in processing time from all the bit-shifting offsets any benefti?) (Edit2: I suppose what I mean actually is, are there any best-practices/guidelines as far as this goes?)

If you have enough registers, do not pack the 8-bit numbers and use one register per element. Now, how do we define "enough registers"? Well, if the occupancy you get allows you to have enough warp parallelism (together with enough instruction level parallelism) to cover the latencies, you are good. In general, unless you have a clear use case, do not pack.

That would completely depend on the context in which you are using them. If you're short on register space, packing them might avoid some register spilling. Otherwise it's probably better to keep them separate. I'd also take a look at the video instructions like VMAD, VADD, VABSDIFF, etc. These can operate directly on packed 8 bit values. But in this mode these instructions are unfortunately only half throughput. Maybe this isn't a big deal for your application but, if you wanted to write a super efficient 8 bit gemm core, they're not ideal. These instructions are full throughput with packed 16 bit values and that is very interesting.. at least until Pascal rolls out with native fp16 support (or if you get a hold of an sm_53 X1)

Looks like @jdemouth beat me to it.

Thanks! :-)

@scott-gray We use integer division in order to extract the individual coordinates of a tensor coefficient from its compound index. We often use compound indices for 2 reasons:

  • they are independent from the rank and the shape of the tensor. This simplifies the fusion of primitive tensor operations. For example, if you reshape a 4D tensor into a 3D tensor all the coordinates need to be adjusted, but the compound indices remain the same.
  • they save registers compared to using individual coordinates. This often makes a significant difference on CPUs which don't have nearly as many registers as GPUs.

@benoitsteiner Ok, that makes sense now. For basic elementwise operations our backend just automatically reshapes all tensors involved in the kernel to the most efficient 2d shape. For broadcast/reduction/take/transpose type operations, it only currently supports those in 2d and requires the user to reshape things prior to performing those ops. This covers 99% of the use cases we've encountered but it sometimes does place a little extra burden on the user. On the other hand it is extremely fast. Sounds like you guys are shooting for much more general ndarray support in which case what you're doing sounds ideal.

Disclaimer: I am totally new to tensorflow and cudnn so I may not know what I am doing but very keen :)

So I built from source then realised that I already had R3 installed; I did what any other sensible person would do and replaced all R2 references with R3 and all seems well as far as running the models included.

@soumith @Yangqing, am I setting myself for trouble here? one word will suffice :)

@milijan You should be running fine. R3 seems to be binary compatible in the sense that most of the functions in R2 still exists in R3. I think R4 may break such hack because it will deprecate a few functions.

In case you are wondering, the reason you are not seeing any speedup by going to R3 may be as follows: in Tensorflow we hard code the cudnn algorithm to be NO_WORKSPACE, so some faster convolution paths are not being selected for now. Upcoming changes should further speed things up.

@Yangqing thanks! πŸ‘

A question with GoogleNet batch size.

Googlenet with batchsize 128 goes Out of Memory. The largest batch-size I could fit is 16 (tried 16, 32, 64, 128)

I can use up to 640 images per batch, using the graph from the tensorflow android example: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/android

Why tensorflow can not survive 32 images in this benchmark?

My setup:

  1. up to date tensorflow (9c3043ff3bf31a6a81810b4ce9e87ef936f1f529), compiled from scratch
  2. K80 GPU with 12 GB memory

Here is the code to load the inception graph:

INPUT_SIZE = 224
OUTPUT_SIZE = 1024

# input should be: BS x INPUT_SIZE x INPUT_SIZE x 3 tensor
# output: BS x OUTPUT_SIZE
def inferences(images):
    graph_def = tf.GraphDef()
    graph_def.ParseFromString(open('./tensorflow_inception_graph.pb').read())
    for n in graph_def.node:
        # control device from caller
        n.device = ''
    tf.import_graph_def(graph_def, input_map = {'input:0': images}, name = name)
    graph = tf.get_default_graph()
    output = graph.get_tensor_by_name(name + '/avgpool0:0')
    return tf.squeeze(output)

Given the big difference between 640 and 32, there must be something wrong. Either mine or this benchmark. Because tensorflow pre-allocating all memory, I don't know how much memory consumed exactly.

@soumith @Yangqing Please help!

@raingo: when training we keep the activations for the lower layers to compute the gradients, so a lot of intermediate memory is used during each training step. When doing inference, you only need the activations around to compute the next operation(s), and then they can be freed, so a lot less intermediate state is needed.

Also, based on the comment in #66 (comment), it sounds like GoogleNet training with TF might now work for up to batch 64, but not batch 128. (I'd be surprised if batch 32 doesn't work at HEAD, for sure.)

@vrv Gotta. Thanks!

@soumith They made some changes to tensorflow TensorFlow: Improve performance of Alexnet. can you update the benchmark for Alexnet

@soumith

Googlenet with batchsize 128 goes Out of Memory. The largest batch-size I could fit is 16 (tried 16, 32, 64, 128) [...] if my memory is right, it's really tight on space to do a batch size of 128 Googlenet in 12GB, and one needs in-place ops for sure.

For comparison, here are my measurements of approximate peak memory usage with Torch/cuDNNv3 on Titan-X:

AlexNet (128): 3 GB
OverFeat (128): 5 GB
VGG Model-A (128): OOM
GoogLeNet(128): 9G

VGG Model-A-11 (64): 8 G
VGG Model-B-13(64): 12 G (I think this may fall back on slower algos due to tight memory)
VGG Model-D-16 (64): 12 G (I think this may fall back on slower algos due to tight memory)
VGG Model-E-19 (64): 12 G (I think this may fall back on slower algos due to tight memory)

VGG Model-A-11 (96): 11 G

@soumith Since its release I've seen pretty dramatic improvements in tensorflow's memory management and performance. I think it may be time to benchmark 0.6.0.

@alexatknit will do. i will take some time one of these days to do MXNet, Chainer and TF 0.6. Have been a bit busy lately with wrapping up research.

I am looking forward to the updated comparison, have you found time to look into it?

TensorFlow Trunk as of 1 hour ago (post 0.6 release) numbers:

AlexNet (One Weird Trick paper) - Input 128x3x224x224

Library Time (ms) forward (ms) backward (ms)
CuDNN-R3 (Torch) 96 32 64
Nervana (Neon) 101 32 69
CuDNN-R2 (Torch) 231 70 161
TensorFlow 0.5 326 96 230
TensorFlow 0.6+ 292 70 222

Overfeat [fast] - Input 128x3x231x231

Library Time (ms) forward (ms) backward (ms)
CuDNN-R3 (Torch) 326 113 213
fbfft (Torch) 342 114 227
CuDNN-R2 (Torch) 810 234 576
TensorFlow 0.5 1084 316 768
TensorFlow 0.6+ 856 204 652

OxfordNet [Model-A] - Input 64x3x224x224

Library Time (ms) forward (ms) backward (ms)
Nervana 590 180 410
CuDNN-R3 (Torch) 615 196 418
CuDNN-R2 (Torch) 1099 342 757
TensorFlow 0.5 1840 545 1295
TensorFlow 0.6+ 1656 347 1309

GoogleNet V1 - Input 128x3x224x224

Library Time (ms) forward (ms) backward (ms)
CuDNN-R3 (Torch) 431 117 313
TensorFlow 0.5 OOM OOM OOM
TensorFlow 0.6+ 1237 246 991

There you go.
The new logs are all checked in.

@soumith Thanks for running the numbers again. I know you have been asked to do this a number of times lately and it takes you away from your research. Having these benchmarks have been greatly useful for everyone.

After your run we realized we seem to have regressed in performance since the 0.6.0 release (mostly from our switch over to the public Eigen branch) and over the last few days @zheng-xq and @benoitsteiner along with others have made improvements to get back the performance. When running the benchmarks again at commit d1b8333, we get the following numbers:

Model Total (ms) Forward (ms) Backward (ms)
AlexNet 229 69 160
Overfeat [fast] 839 203 636
OxfordNet 1216 329 887
GoogleNet V1 - Input 128x3x224x224 815 234 581
  • This is measured on an unsuperclocked Titan-X with the default power-limit 250W.
  • For consistency, between each run, we wait for a few minutes for GPU to cool down to room temperature.

These results are also in line with what we see at 0.6.0 release.

We are also looking into setting up performance benchmarks with the builds so we don't hit such performance regressions.

Again, Thanks for all your updates.

Does anyone has experiences and/or comparisons with DL4J (http://deeplearning4j.org) ?

@rajatmonga just got back from vacay. It's cool that you guys are setting up contbuilds for perf regressions.

However, I dont get the numbers that you seem to be getting on the tensorflow as of yesterday ( a27d844e05447e65aa279ae5269a2d75590f46f6 ). The numbers are slightly better but not quite the improvement that you are seeing.

Look here for the new numbers: 1f09e1e

@soumith Thanks for running the benchmarks again. It is possible there are some memory related regressions that are hurting performance again. What you have right now is good, lets not worry about this.

We are working on getting cuDNN R4 fully supported and will address the remaining performance issues in that context. May ping this thread once we have a full release with R4, and it will be worthwhile rerunning benchmarks - likely for many of the libraries.

Also, let me know if we can help you with this project in any way - it is very useful to the community, but I am sure it takes a lot of your time as well. Thanks for keeping this going!

Yes, That is in our list of tasks and is quite important to make sure we
don't have performance regressions. We haven't been able to get to it yet.

On Thu, Feb 4, 2016 at 9:11 AM Madder notifications@github.com wrote:

Has anyone thought of running these benchmarks periodically as part of
tensorflow's CI for instance?

β€”
Reply to this email directly or view it on GitHub
#66 (comment)
.

Tf 0.7.0 released!
Looking forward to the updated benchmarks.

πŸ‘ +1:

Great results πŸ‘ πŸ‘ πŸ‘

Looking forward to the results with cuDNN v4

+1

On Tue, Feb 23, 2016 at 10:29 PM, Ronghang Hu notifications@github.com
wrote:

Great results [image: πŸ‘] [image: πŸ‘] [image: πŸ‘]

Looking forward to the results with cuDNN v4

β€”
Reply to this email directly or view it on GitHub
#66 (comment)
.

As requested, TF 0.7 + CuDNN R4 has been benchmarked. CuDNN R4 + Torch has also been benchmarked as a baseline.

Within the span of Nervana's Neon, Torch + CuDNN4, TensorFlow + CuDNN4 (and Caffe + CuDNN is likely in the same ballpark as torch), TensorFlow ( at commit tensorflow/tensorflow@1d4f00d ) still lags behind the others by 2x to 3x performance on Alexnet, VGG and Googlenet. It is within 1.5x of Overfeat.

For full details, see the main README.md: https://github.com/soumith/convnet-benchmarks/blob/master/README.md and the raw logs are located here: 2888b23