Benchmark TensorFlow

Question

Benchmark TensorFlow

soumith opened this issue 8 years ago · comments

Google's TensorFlow benchmarks are here!

I've run the benchmarks on the Imagenet Winners.
When I saw issues with the numbers, memory etc., I emailed @Yangqing to confirm what I'm seeing, and that it is expected.

With that disclaimer out of the way, here's some things that you should know about TensorFlow (as of the pip version that I installed today):

in-place ReLU seems non-existent in practice.
- Yangqing says: "right now there are little in-place operations in TensorFlow and we pretty much rely on the scheduler and the memory pool to allocate and deallocate memory"
Supports CuDNN R2. No R3 support yet, Yangqing says the next version they are going to support is likely R4.

Coming to the benchmarks:

Googlenet with batchsize 128 goes Out of Memory. The largest batch-size I could fit is 16 (tried 16, 32, 64, 128)
VGG with batchsize 64 goes Out of Memory (Edit: VGG memory issue was solved by using the BFC allocator updated by GOOG). ~~The largest batch-size I could fit is 32 (tried 32, 64).~~
I've also computed Torch7+CuDNN-R2 baselines for these batch-sizes.

AlexNet (One Weird Trick paper) - Input 128x3x224x224

Library	Time (ms)	forward (ms)	backward (ms)
CuDNN-R3 (Torch)	96	32	64
Nervana (Neon)	101	32	69
CuDNN-R2 (Torch)	231	70	161
TensorFlow	326	96	230

Overfeat [fast] - Input 128x3x231x231

Library	Time (ms)	forward (ms)	backward (ms)
CuDNN-R3 (Torch)	326	113	213
fbfft (Torch)	342	114	227
CuDNN-R2 (Torch)	810	234	576
TensorFlow	1084	316	768

OxfordNet [Model-A] - Input 64x3x224x224

Library	Time (ms)	forward (ms)	backward (ms)
Nervana	590	180	410
CuDNN-R3 (Torch)	615	196	418
CuDNN-R2 (Torch)	1099	342	757
TensorFlow	1840	545	1295

GoogleNet V1 - Input 16x3x224x224

Library	Time (ms)	forward (ms)	backward (ms)
CuDNN-R2 (Torch)	564	174	390
TensorFlow	590	54	536

Note that at batch size of 16, googlenet with CuDNN-R2 + Torch likely runs into dispatching overhead, so it's an exotic comparison, but not practically very interesting or encouraging.

There you go.

I'm assuming that the first release of TensorFlow is still quite unpolished, and that they will improve it over time with various memory and time optimizations baked in.

Lekan commented 8 years ago

Nice!

Mikalai Drabovich commented 8 years ago

👍 +1:

Soumith Chintala · Answer 1 · Wed Nov 11 2015 16:32:05 GMT+0800 (China Standard Time)

The benchmark scripts and raw outputs are located here: https://github.com/soumith/convnet-benchmarks/tree/master/tensorflow

Scott Gray · Answer 2 · Wed Nov 11 2015 17:21:29 GMT+0800 (China Standard Time)

The lack of in place operations is rather surprising. Once you have the full DAG it should be rather easy to apply a liveness algorithm to it to optimize tensor allocations. For an example see this: http://www.diku.dk/hjemmesider/ansatte/torbenm/ICD/Register.pdf (just replace register with tensor).

I'm kind of curious if there's any support for automatically compounding operations together or of leveraging kernels that have some compounding built in (like the alpha/beta params of gemm). I'm pretty close to maximizing the amount of compounding that's possible in my benchmark networks. And because I write all my own kernels I can further compound things that aren't possible with closed source libraries like cuDNN. For example, I'm now able to compute the mean along the PQN dimension inside the conv and gemm kernels at no cost. This cuts down the bandwidth required by batch norm in fprop by a third.

Though on the whole I think TensorFlow seems like a great platform to build on. I'd say there's a good chance my kernels will make their way there sooner rather than later. You can find new benchmarks of my latest winograd kernels in the updated paper here: http://arxiv.org/abs/1509.09308

What I'll be working on next is basically going to be taking a lot of what I learned implementing winograd and refreshing all of my conv/pooling/gemm kernels to support very small minibatches at near full utilization. This should have a big impact on the level at which you can scale these networks and the speed at which they converge. Here's a great paper exploring this: http://arxiv.org/abs/1509.04210

Zhou Yu · Answer 3 · Wed Nov 11 2015 20:55:32 GMT+0800 (China Standard Time)

Hi, I strongly recommand to add mxnet https://github.com/dmlc/mxnet into comparision which in my opinion may be the fastest DL library :)

mavenlin · Answer 4 · Wed Nov 11 2015 21:48:19 GMT+0800 (China Standard Time)

+1 for benchmarking mxnet, the fastest now.

strongbanker · Answer 5 · Wed Nov 11 2015 22:30:35 GMT+0800 (China Standard Time)

+1 for benchmarking mxnet

Francesco · Answer 6 · Wed Nov 11 2015 23:23:53 GMT+0800 (China Standard Time)

I would also love to see a comparison with Theano http://deeplearning.net/software/theano/ as it is another widely adopted deep learning library.

Nicolas Koumchatzky · Answer 7 · Wed Nov 11 2015 23:28:18 GMT+0800 (China Standard Time)

Thanks for benchmarking!

Aaron Wroblewski · Answer 8 · Wed Nov 11 2015 23:59:37 GMT+0800 (China Standard Time)

+1 would love to see tensorflow benchmarked against mxnet, Theano, Autograd for Torch, and Caffe.

Vincent Vanhoucke · Answer 9 · Thu Nov 12 2015 00:01:05 GMT+0800 (China Standard Time)

Thanks @soumith! Yes, our only launch criterion for convnets was 'GoogLeNet within distance from CuDNN[R2]', and we've punted on a lot of performance work, including upgrading CuDNN, until after the initial release. Expect a lot of movement on that front in the coming weeks.

Soumith Chintala · Answer 10 · Thu Nov 12 2015 00:26:02 GMT+0800 (China Standard Time)

@aaronwro @fvisin it's already benchmarked against Torch, Theano, Caffe. Look at the readme on the main page ( https://github.com/soumith/convnet-benchmarks/blob/master/README.md ).
I definitely need to pull my socks up and benchmark MXNet and Chainer.

@vincentvanhoucke thanks for your response. I assumed that you'll fix it over the next weeks / months :)

Vincent Vanhoucke · Answer 11 · Thu Nov 12 2015 00:29:43 GMT+0800 (China Standard Time)

@scott-gray let us know if you need help with compounding / graph rewriting. The graph representation is meant to make these kinds of operations possible, and the common subexpression elimination that TF currently uses is also meant as a demonstration of that. I suspect we might need to do a bit more to provide good APIs to make it easier to bake in compound kernels.

Soumith Chintala · Answer 12 · Thu Nov 12 2015 00:33:17 GMT+0800 (China Standard Time)

there seems to be some misinterpretation by random people in social media that because I work for Facebook, I'm attacking TensorFlow. That seems super weird, because I love the vision of TensorFlow, and there's no competition (one can write a XXX frontend for a TensorFlow backend).

My benchmarks have always been independently run, and completely neutral, I've been running them forever now, sad that people misinterpret the slightest of things.
cc: @vincentvanhoucke

Clement Farabet · Answer 13 · Thu Nov 12 2015 00:35:18 GMT+0800 (China Standard Time)

will defend Soumith on this one – he has indeed been running these
benchmarks for quite some time, and complete neutrality.

On Wed, Nov 11, 2015 at 11:33 AM, Soumith Chintala <notifications@github.com

wrote:

there seems to be some misinterpretation by random people in social media
that because I work for Facebook, I'm attacking TensorFlow. That seems
super weird, because I love the vision of TensorFlow, and there's no
competition (one can write a XXX frontend for a TensorFlow backend).

My benchmarks have always been independently run, and completely neutral,
I've been running them forever now, sad that people misinterpret the
slightest of things.
cc: @vincentvanhoucke https://github.com/vincentvanhoucke

—
Reply to this email directly or view it on GitHub
#66 (comment)
.

Francesco · Answer 14 · Thu Nov 12 2015 00:35:56 GMT+0800 (China Standard Time)

@soumith Excellent, thank you!!

Vincent Vanhoucke · Answer 15 · Thu Nov 12 2015 00:37:02 GMT+0800 (China Standard Time)

@soumith no good deed goes unpunished ;) Please don't let this deter you from providing this valuable service to the community!

Yangqing Jia · Answer 16 · Thu Nov 12 2015 00:37:27 GMT+0800 (China Standard Time)

@soumith , I am sorry that some people interpreted things that way. I've always appreciated your benchmark, which creates a great atmosphere for us to look at bottlenecks and push forward the field as a whole community. We all owe you a big debt of gratitude.

Aaron Wroblewski · Answer 17 · Thu Nov 12 2015 00:37:46 GMT+0800 (China Standard Time)

@soumith thanks!

Julien Demouth · Answer 18 · Thu Nov 12 2015 00:52:02 GMT+0800 (China Standard Time)

As always, that's super interesting. Thanks for pushing all of us toward more performance.

Tianqi Chen · Answer 19 · Thu Nov 12 2015 01:22:49 GMT+0800 (China Standard Time)

For memory optimizations, what we have found is that inplace optimization does not matter that much, if the allocator is smart enough to do a static allocation before running the graph(as opposed to relying on a dynamic allocator). We have detailed what can be done here

https://mxnet.readthedocs.org/en/latest/developer-guide/note_memory.html

Which I assume applies to computation graph frameworks such as TF, caffe2 and CGT.
@vincentvanhoucke @Yangqing

Tianqi Chen · Answer 20 · Thu Nov 12 2015 01:25:00 GMT+0800 (China Standard Time)

The general idea is not only to share memory of same shape(i.e. inplace) , but also different shapes and size

Rajat Monga · Answer 21 · Thu Nov 12 2015 01:29:47 GMT+0800 (China Standard Time)

@soumith Thanks for running the benchmarks! As @vincentvanhoucke noted in this thread, our goal was to get an early release out so users can start playing with it and provide feedback on what they care about. We are committed to making TensorFlow fast and are actively working on the performance issues you highlight here.

Alex Wiltschko · Answer 22 · Thu Nov 12 2015 01:42:06 GMT+0800 (China Standard Time)

@soumith You're doing a good deed! Haters gonna hate.

Eric Junyuan Xie · Answer 23 · Thu Nov 12 2015 02:35:21 GMT+0800 (China Standard Time)

I'm a little confused by the number. 1300 samples/sec seems too fast even for alexnet on single TitanX. Is this standard training, e.g. io+forward+backward+update, or just forward+backward?

Mark Montgomery · Answer 24 · Thu Nov 12 2015 02:35:26 GMT+0800 (China Standard Time)

Nice work.

Bing Xu · Answer 25 · Thu Nov 12 2015 02:44:59 GMT+0800 (China Standard Time)

@piiswrong I will help @soumith make the benchmark script.

Anyway we opened everything since beginning. The main purpose is learning from each other but not advertise boring number.

koray kavukcuoglu · Answer 26 · Thu Nov 12 2015 02:53:49 GMT+0800 (China Standard Time)

I will also add my support to Soumith. He has been running these benchmarks for sometime with complete transparency and neutrality.

sermanet · Answer 27 · Thu Nov 12 2015 03:30:11 GMT+0800 (China Standard Time)

@koraykv +1, thanks Soumith!

Soumith Chintala · Answer 28 · Thu Nov 12 2015 04:09:49 GMT+0800 (China Standard Time)

Someone on reddit suggested that I build tensorflow from source, to fix speed issues. That did not help, It produces the same numbers as the pip version on my alexnet script :

https://gist.github.com/soumith/11acc2f0dbc5212ea372

Soumith Chintala · Answer 29 · Thu Nov 12 2015 04:57:20 GMT+0800 (China Standard Time)

FWIW, Yangqing's fix to avoid CPU-GPU transfers improved results across the board by ~20%. (I've updated the tables above). The memory issues are unchanged.

Zheng Xu · Answer 30 · Thu Nov 12 2015 06:19:00 GMT+0800 (China Standard Time)

+1 for mxnet! Thanks.

Yeqing Li · Answer 31 · Thu Nov 12 2015 06:25:59 GMT+0800 (China Standard Time)

+1 for mxnet.

Junli Gu · Answer 32 · Thu Nov 12 2015 06:29:19 GMT+0800 (China Standard Time)

@soumith I have a naive question, is the Tensor Flow's result based on c++ code or cuDNN v2? I would guess if you run on Titanx tensor flow will rely on some CUDA library?

Soumith Chintala · Answer 33 · Thu Nov 12 2015 06:32:33 GMT+0800 (China Standard Time)

@gujunli it's based on CuDNN V2.

Matthew Johnson · Answer 34 · Thu Nov 12 2015 06:35:30 GMT+0800 (China Standard Time)

@soumith thanks for running and maintaining these benchmarks; they're always thorough and informative!

Junli Gu · Answer 35 · Thu Nov 12 2015 06:38:38 GMT+0800 (China Standard Time)

@soumith Then I don't understand Why Tensor Flow with cuDNN v2 ends up being so slow? Can you share some of your understanding? I will guess TF still calls cuDNN v2 for the conv/pool/relu/FC layers. Remember from your earlier AlexNet results, cuDNN v2 is 231=70+161, Caffe (native) ConvolutionLayer 324=121+203. However Tensor flow is 326=96+230.

Scott Gray · Answer 36 · Thu Nov 12 2015 06:42:44 GMT+0800 (China Standard Time)

Running the network under nvvp (nvidia visual profiler) should be pretty informative. A well tuned network timeline should just be a solid block of kernel calls with no gaps.

Junli Gu · Answer 37 · Thu Nov 12 2015 06:48:14 GMT+0800 (China Standard Time)

@scott-gray so you think TF scheduling may not be efficient? I need to read TF whitepaper to understand how it works. Any one understands?

Scott Gray · Answer 38 · Thu Nov 12 2015 06:54:07 GMT+0800 (China Standard Time)

@gujunli I'm just saying if they're just using stock cuDNNv2 then the only reason it would be slower is if there were gaps in the timeline. Seeing where those gaps occur and any extra host/device memcpy traffic would give you a clearer picture of what's going wrong.

Andre Pemmelaar · Answer 39 · Thu Nov 12 2015 07:04:34 GMT+0800 (China Standard Time)

@soumith Thanks for this and all the other previous benchmark you took the time to create.

+1 for MxNet

Sheng Wang · Answer 40 · Thu Nov 12 2015 07:08:30 GMT+0800 (China Standard Time)

+1 for mxnet! Thank you so much!!!

Yangqing Jia · Answer 41 · Thu Nov 12 2015 07:12:32 GMT+0800 (China Standard Time)

@gujunli @scott-gray To provide some historical perspective: this is mostly due to legacy choices. Historically, Google Brain has been using the NHWC storage order and a slightly different padding scheme ("SAME/VALID" instead of an explicit padding number). CuDNN, as well as Caffe, uses NCHW order. Note that CuDNN support NHWC interface-wise, but some underlying paths are not implemented, like NHWC convolution backward.

As a result, when calling cuDNN, there are some code that generates intermediate padded and order-switched intermediate tensors. The code was written with Eigen and did not interact very well with nvcc, causing a nontrivial overhead (you can observe that by running the benchmark in an nvvp session as Scott suggested).

We are having people looking into this and the performance should be brought to cuDNN-level.

Scott Gray · Answer 42 · Thu Nov 12 2015 07:38:29 GMT+0800 (China Standard Time)

Gah, everyone's using different tensor layouts still. You all need to turn from the dark side and see the speed benefits to using CHWN. Though NHWC is probably better than NCHW at least. You want that inner dimension to be a nice even number to facilitate cleaner alligned memory access, leading to less over-fetch. CHWN gets you better contiguous memory access over all. In recurrent networks with model parallelism having N as the outer dim definitely helps, but most distributed convnets are data parallel where it doesn't matter.

I have some very fast shared memory dimshuffle code if you want it. I use it to make this operation on the filters:

# C <=> K and mirror R,S
F = np.transpose(F[:,::-1,::-1,:], (3,1,2,0))

Turns out a kernel for fprop_conv can work with very little change (or no change if padding and striding are symmetric) to be a kernel for bprop_conv. There's almost no overhead in the dimshuffle since the filters are so small and you completely avoid any atomic adds.

Alex Park · Answer 43 · Thu Nov 12 2015 08:16:41 GMT+0800 (China Standard Time)

krizhevsky first demonstrated the benefits of using the CHWN layout in cuda-convnet. In addition to being advantageous for convolutional kernels, it's very beneficial for models like GoogLeNet where inception modules concatenate activations along feature map depth. Using CHWN allows you to write directly into an output buffer in the layout that the subsequent layer will consume (C1 + C2 + C3)HWN.

Yangqing Jia · Answer 44 · Thu Nov 12 2015 09:07:48 GMT+0800 (China Standard Time)

Thanks @scott-gray - having the dim shuffle kernels to improve performance will be great.

One potential issue with CHWN is that during inference time N is often small, so there are two different sets of optimizations to be carried out for large N and small N. NCHW/NHWC usually makes things a bit batch-agnostic, but that's not always true of course.

Yangqing Jia · Answer 45 · Thu Nov 12 2015 09:11:12 GMT+0800 (China Standard Time)

@soumith Regarding the memory issue, we found that if one turns on the best-fit GPU allocator, you would be able to run VGG on batch sizes of 64. I did a quick change if you would like to build and try from source:

git clone https://github.com/Yangqing/tensorflow.git
cd tensorflow
git checkout bfc

There will be more fixes to be submitted by @vrv to enable it more easily (such as during a session creation time) down the road.

Scott Gray · Answer 46 · Thu Nov 12 2015 09:46:14 GMT+0800 (China Standard Time)

@Yangqing The shuffle code is here (note that this does not do the RS mirror operation):
https://github.com/NervanaSystems/neon/blob/master/neon/backends/float_ew.py#L1481

It uses magic numbers for fast integer division. Here's the code that sets up the kernel params:
https://github.com/NervanaSystems/neon/blob/master/neon/backends/layer_gpu.py#L504

The code is adapted from here (the diagram will be helpful):
http://devblogs.nvidia.com/parallelforall/efficient-matrix-transpose-cuda-cc/

It's on my list of things to do to generalize it and make it available as a flexible backend operation. But I haven't gotten to it. Theano may also have some good dimshuffle code you can borrow.

Also, most of the code in that float_ew file is devoted to automatically generating extremely efficient compound elementwise/reduction/broadcast/transpose/take operations. It allows you to write complex numpy expressions and have them compile to a single cuda kernel. It even does common sub-expression removal, but sounds like you already have that. This all works off of little optrees that exist in layer code. But I've been meaning to find a way to collect the full program DAG in a clean way. Seems like you guys solved that and that's why I'm interested in TensorFlow. There's so much burden you can shift from the programmer and have automatically optimized via graph traversals.

Chuntao Hong · Answer 47 · Thu Nov 12 2015 10:42:43 GMT+0800 (China Standard Time)

+1 for mxnet. Dynamic GPU memory allocation does have a big impact on performance. A simple memory allocator can significantly reduce the overhead. A smarter allocator which reuses blocks with best-fit can almost eliminate the overhead completely.

Vijay Vasudevan · Answer 48 · Thu Nov 12 2015 10:57:56 GMT+0800 (China Standard Time)

@soumith I just pushed tensorflow/tensorflow@1d76583, which should allow you to use our best-fit-with-coalescing allocator via the ConfigProto.

Example usage here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/models/image/alexnet/alexnet_benchmark.py#L201

We were able to get some of the larger batch sizes working with the BFC Allocator, so probably worth a try.

(We plan to make the BFC allocator the default soon, but it's not fully ready yet to be the default).

Arjun Jain · Answer 49 · Thu Nov 12 2015 11:12:21 GMT+0800 (China Standard Time)

Thanks a lot @soumith for the numbers, super useful!

futurely · Answer 50 · Thu Nov 12 2015 11:23:44 GMT+0800 (China Standard Time)

The creators of cuDNN [1] may help with the performance optimization. @bryancatanzaro

[1] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer. cuDNN: Efficient Primitives for Deep Learning. arXiv preprint arXiv:1410.0759, 2014.

Hugh Perkins · Answer 51 · Thu Nov 12 2015 13:06:22 GMT+0800 (China Standard Time)

hjk41 wrote:

+1 for mxnet. Dynamic GPU memory allocation does have a big impact on performance. A simple memory allocator can significantly reduce the overhead. A smarter allocator which reuses blocks with best-fit can almost eliminate the overhead completely.

Hi mxnet guys, the title of this thread is 'Benchmark TensorFlow' ;-) I think you could create a new issue to request mxnet, using https://github.com/soumith/convnet-benchmarks/issues/new

[edit: looks like Soumith has created an issue for mxnet here: https://github.com//issues/68 ]

ces-bertino · Answer 52 · Thu Nov 12 2015 14:17:30 GMT+0800 (China Standard Time)

I was curious about the performance of Tensorflow using CUDA 7.5 and CUDNN 7.0. I modified the build to use them and rebuilt source. I then ran @soumith benchmark scripts for alexnet and overfeat.
My PC is getting old (Intel Core 2 quad), 16 GB RAM, NVIDIA Titan X, Ubuntu 15.04 x86_64

Alexnet: forward/backward 290ms, forward 78ms. (1.12x improvement for f/b)
Overfeat: forward/backward 1040ms, forward 264ms (1.04x improvement for f/b).

So not much of a speedup by only swapping the CUDA libraries.

@soumith, one thing I did notice is that your benchmarks for Caffe are quite different from what I got using CUDA 7.5. I think the benchmarks you used are with CUDA 7.0, right?. When I ran your "run_imagenet.sh" script on my setup, I got much better results.
Alexnet: forward/backward 171ms, forward 41ms ( 1.89x improvement for F/B)
Overfeat: forward/backward 601ms, forward 133ms ( 1.37x improvement)
Googlenet: F/B 624ms, F 174ms ( 3.1x improvement)

It's not clear to me if CUDA 7.5 is supported for Caffe, but in https://github.com/BVLC/caffe/wiki/Installation, they provide 7.0 and 7.5 Docker images. However, in the main installation instructions they say to use 7.0

I attached the log files for his benchmarks running on my PC.

Tensorflow:

tensorflow_alexnet.txt
tensorflow_overfeat.txt

Caffe:

output_alexnet.txt
output_googlenet.txt
output_overfeat.txt
output_vgg_a.txt

ytsaig · Answer 53 · Thu Nov 12 2015 17:47:15 GMT+0800 (China Standard Time)

@soumith Thanks for an invaluable community service!

Xingdi (Eric) Yuan · Answer 54 · Thu Nov 12 2015 22:51:15 GMT+0800 (China Standard Time)

@soumith Thanks you.

Andreas ten Pas · Answer 55 · Thu Nov 12 2015 23:00:45 GMT+0800 (China Standard Time)

@soumith Thank you for benchmarking!!

Jeff Johnson · Answer 56 · Fri Nov 13 2015 00:35:14 GMT+0800 (China Standard Time)

@scott-gray Sorry to spam @soumith's TF discussions, but when I last played with integer division via magic number mul-and-shift on GPU, the performance I got (on K40 though) was about the same as straightforward division by unsigned int32; the compiler seemed to have strength reductions that it performed in this case. However, there was lower register usage, so using this technique in a kernel that actually does other things (like transposition) would probably help. This was at the SASS level on CUDA 6.5? though.

facebookarchive/fbcuda@d5c8b38

Julien Demouth · Answer 57 · Fri Nov 13 2015 00:53:15 GMT+0800 (China Standard Time)

The compiler does a good job when the constant is known in advance. Was it your case?

Mendrika · Answer 58 · Fri Nov 13 2015 03:18:49 GMT+0800 (China Standard Time)

Nice! Thank you @soumith

Scott Gray · Answer 59 · Fri Nov 13 2015 04:40:01 GMT+0800 (China Standard Time)

@wickedfoo The advantage of calculating the magic numbers manually is that the divisor is typically parameterized and so the compiler can't compute magic numbers ahead of time. So it then falls back to to using the floating point rcp operator and doing a bunch of corrections to make up for the potentially shorter mantissa of float (23 vs 32 bits).

To do an integer division and modulus with magic numbers reduces to just this code:

// j   = jrst / RST
// rst = jrst % RST
int j   = jrst * magic_RST; j >>= shift_RST;
int rst = jrst - j * RST;

If you know all those numbers fit in 16 bits you can use vmad from ptx or sass. That looks like this:

VMAD.U16.U16 j, jrst, magic_RST, RZ;
SHR.U32      j, j, shift_RST;
VMAD.U16.U16 rst, -j, RST, jrst;

Otherwise your multiplications are going to expand out to 3 XMADS each, regardless of the datatype used. It would be nice if the compiler was a little smarter about multiplication by using the minimal number of instructions for the given data types.

For larger values that might require 64 bit math, I use something like this:

      MOV  magicPQ,    param_magic_PQ;
      IADD negPQ, RZ, -param_grid_PQ;

      ISETP.NE.AND P1, PT, magicPQ, 1, PT;

      // m = blkMPQ / PQ
  @P1 XMAD     div1, blkMPQ,    magicPQ,    RZ;
  @P1 XMAD     div2, blkMPQ,    magicPQ.H1, RZ;
  @P1 XMAD     div3, blkMPQ.H1, magicPQ.H1, RZ;
  @P1 XMAD.CHI div1, blkMPQ.H1, magicPQ,    div1;
  @P1 IADD3.RS m, div1, div2, div3;
  @P1 SHR.U32  m, m,      param_shift_PQ;
 @!P1 SHR.U32  m, blkMPQ, param_shift_PQ;

      // pq = blkMPQ % PQ
      XMAD pq, negPQ, m, blkMPQ;
      XMAD.PSL pq, negPQ.H1, m, pq;

Integer division is essential for these multi dimensional tensors where you cant fit everything in just 3 block coordinates. For more advanced uses, you can leverage it to pack all your coordinates into a single blockIdx.x value, then completely remap the order in which the indexes are scheduled. I'm able to achieve 95% L2 hit rates using this in my winograd kernels. This is essential for good performance as the small 32x32 batched gemm tile is pretty high bandwidth.

Karen Ng · Answer 60 · Fri Nov 13 2015 05:22:22 GMT+0800 (China Standard Time)

Is there any benchmark showing the predictive performance? Computing fast but inaccurate predictions does not seem useful.

Benoit Steiner · Answer 61 · Fri Nov 13 2015 09:12:23 GMT+0800 (China Standard Time)

@scott-gray TensorFlow already uses fast integer division using the code in http://github.com/tensorflow/tensorflow/blob/master/third_party/eigen3/unsupported/Eigen/CXX11/src/Tensor/TensorIntDiv.h. One of the issues is that a lot of the TensorFlow kernels use 64 bit integers to index tensors, which ends up slowing things down on GPU. This is being fixed.

Hugh Perkins · Answer 62 · Fri Nov 13 2015 13:01:31 GMT+0800 (China Standard Time)

karenyyng wrote:

Is there any benchmark showing the predictive performance? Computing fast but inaccurate predictions does not seem useful.

Ideally, they are all learning the exact same model, so the outputs should be identical (to within the bounds of rounding accuracy). A correctness check is not a bad idea though.

Soumith Chintala · Answer 63 · Fri Nov 13 2015 14:41:58 GMT+0800 (China Standard Time)

@vrv thanks a lot, trying out the BFC allocator now for vgg and googlenet models.

Soumith Chintala · Answer 64 · Fri Nov 13 2015 15:47:21 GMT+0800 (China Standard Time)

@ces-bertino the numbers I have with Caffe are with Caffe's native kernels (that's why the entry is marked as Caffe (native) ", I presume you have CuDNN, and hence have the speedups. To compare your entries, look at the entry marked as CuDNN

Soumith Chintala · Answer 65 · Fri Nov 13 2015 15:58:15 GMT+0800 (China Standard Time)

@vrv updated the table for VGG.
Googlenet still goes OOM at batch size 128, but if my memory is right, it's really tight on space to do a batch size of 128 Googlenet in 12GB, and one needs in-place ops for sure.

Andrew Tulloch · Answer 66 · Sat Nov 14 2015 06:34:00 GMT+0800 (China Standard Time)

@Yangqing, @benoitsteiner do you have a sense for how performance for these benchmarks depends on nvcc vs gpucc? Are the 10%-50% numbers in http://llvm.org/devmtg/2015-10/slides/Wu-OptimizingLLVMforGPGPU.pdf for ic1/ic2 applicable here?

Rajat Monga · Answer 67 · Sat Nov 14 2015 07:41:31 GMT+0800 (China Standard Time)

@ajtulloch gpucc does hide these latency issues vs nvcc as it seems to do a much better job at optimization. Using gpucc brings TensorFlow pretty close to the cuDNN[R2] numbers for AlexNet.
We are working on bridging that gap for nvcc by addressing a number of specific issues that @benoitsteiner and @Yangqing mentioned earlier.

Benoit Steiner · Answer 68 · Sat Nov 14 2015 08:20:46 GMT+0800 (China Standard Time)

@ajtulloch The 2 main reasons why gpucc can generate faster code than nvcc are:

The fact that gpucc can replace 64 bit integer divisions with 32 bit divisions if the values stored in the 64 bit integers can actually fit in 32 bit. As we update the TensorFlow convolution kernels to use 32 bit indices the performance of the code generated by nvcc will start to approach that of the code generated by gpucc.
The fact that clang supports c++11 constant expressions much better than nvcc. Constant expressions allow us to generate much more efficient CUDA kernels. Unfortunately for the time being we have to disable this feature since the corresponding code doesn't compile with nvcc. I am rewritting the corresponding code to make it compatible with nvcc 7.5, and hopefully with nvcc 7.0 as well.

Andrew Tulloch · Answer 69 · Sat Nov 14 2015 10:51:16 GMT+0800 (China Standard Time)

@rajatmonga, @benoitsteiner that makes sense, thanks for that.

Scott Gray · Answer 70 · Sat Nov 14 2015 15:42:03 GMT+0800 (China Standard Time)

@benoitsteiner I'm curious how you guys are using integer division in your implementation. The only places I find a need to use it are in custom kernels where I'm unpacking multiple coordinates from a compound index.

On a related note, I should mention that I also have another simple technique I developed for when you don't know ahead of time the value of the divisor. It looks something like this:

// rcpRST = 1 / RST
I2F.F32.S32 rcpRST, RST;
MUFU.RCP rcpRST, rcpRST;

// c = crst / RST
I2F.F32.S32 crst, crst;
FMUL c, crst, rcpRST;
FFMA c, c, 5.9604644775390625e-08, c;
F2I.S32.F32.TRUNC c, c;

// rst = crst % RST
VMAD.U16.U16 rst, -c, RST, crst;

For most values the floating point reciprocal gets you the correct value. It's just when the numerator and denominator are very close that you need to correct for the missing precision in float32. This is a lot less code than the compiler would generate and is accurate for the range of values I need it for.

Hugh Perkins · Answer 71 · Sat Nov 14 2015 15:46:11 GMT+0800 (China Standard Time)

are in custom kernels where I'm unpacking multiple coordinates from a compound index

Somewhat related: question that I've been wondering about somewhat, and never quite got around to measuring: if we have two 8-bit integers, is it faster to store in separate registers/variables, or faster to pack into one register/variable, using bit-shifting? (Edit: I suppose this is a bit vague really, since it entirely depends on how they're being used ... but I guess the trade-off I'm thinking about is: packing multiple values into a few registers will reduce register pressure, but maybe the increase in processing time from all the bit-shifting offsets any benefti?) (Edit2: I suppose what I mean actually is, are there any best-practices/guidelines as far as this goes?)

Julien Demouth · Answer 72 · Sat Nov 14 2015 16:00:35 GMT+0800 (China Standard Time)

If you have enough registers, do not pack the 8-bit numbers and use one register per element. Now, how do we define "enough registers"? Well, if the occupancy you get allows you to have enough warp parallelism (together with enough instruction level parallelism) to cover the latencies, you are good. In general, unless you have a clear use case, do not pack.

Scott Gray · Answer 73 · Sat Nov 14 2015 16:05:24 GMT+0800 (China Standard Time)

That would completely depend on the context in which you are using them. If you're short on register space, packing them might avoid some register spilling. Otherwise it's probably better to keep them separate. I'd also take a look at the video instructions like VMAD, VADD, VABSDIFF, etc. These can operate directly on packed 8 bit values. But in this mode these instructions are unfortunately only half throughput. Maybe this isn't a big deal for your application but, if you wanted to write a super efficient 8 bit gemm core, they're not ideal. These instructions are full throughput with packed 16 bit values and that is very interesting.. at least until Pascal rolls out with native fp16 support (or if you get a hold of an sm_53 X1)

Looks like @jdemouth beat me to it.

Hugh Perkins · Answer 74 · Sat Nov 14 2015 16:37:17 GMT+0800 (China Standard Time)

Thanks! :-)

Benoit Steiner · Answer 75 · Sat Nov 14 2015 23:46:04 GMT+0800 (China Standard Time)

@scott-gray We use integer division in order to extract the individual coordinates of a tensor coefficient from its compound index. We often use compound indices for 2 reasons:

they are independent from the rank and the shape of the tensor. This simplifies the fusion of primitive tensor operations. For example, if you reshape a 4D tensor into a 3D tensor all the coordinates need to be adjusted, but the compound indices remain the same.
they save registers compared to using individual coordinates. This often makes a significant difference on CPUs which don't have nearly as many registers as GPUs.

Scott Gray · Answer 76 · Sun Nov 15 2015 03:25:26 GMT+0800 (China Standard Time)

@benoitsteiner Ok, that makes sense now. For basic elementwise operations our backend just automatically reshapes all tensors involved in the kernel to the most efficient 2d shape. For broadcast/reduction/take/transpose type operations, it only currently supports those in 2d and requires the user to reshape things prior to performing those ops. This covers 99% of the use cases we've encountered but it sometimes does place a little extra burden on the user. On the other hand it is extremely fast. Sounds like you guys are shooting for much more general ndarray support in which case what you're doing sounds ideal.

milijan · Answer 77 · Fri Nov 20 2015 05:34:18 GMT+0800 (China Standard Time)

Disclaimer: I am totally new to tensorflow and cudnn so I may not know what I am doing but very keen :)

So I built from source then realised that I already had R3 installed; I did what any other sensible person would do and replaced all R2 references with R3 and all seems well as far as running the models included.

@soumith @Yangqing, am I setting myself for trouble here? one word will suffice :)

Yangqing Jia · Answer 78 · Fri Nov 20 2015 07:10:00 GMT+0800 (China Standard Time)

@milijan You should be running fine. R3 seems to be binary compatible in the sense that most of the functions in R2 still exists in R3. I think R4 may break such hack because it will deprecate a few functions.

In case you are wondering, the reason you are not seeing any speedup by going to R3 may be as follows: in Tensorflow we hard code the cudnn algorithm to be NO_WORKSPACE, so some faster convolution paths are not being selected for now. Upcoming changes should further speed things up.

milijan · Answer 79 · Fri Nov 20 2015 07:22:35 GMT+0800 (China Standard Time)

@Yangqing thanks! 👍

Raingo · Answer 80 · Wed Nov 25 2015 12:34:51 GMT+0800 (China Standard Time)

A question with GoogleNet batch size.

Googlenet with batchsize 128 goes Out of Memory. The largest batch-size I could fit is 16 (tried 16, 32, 64, 128)

I can use up to 640 images per batch, using the graph from the tensorflow android example: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/examples/android

Why tensorflow can not survive 32 images in this benchmark?

My setup:

up to date tensorflow (9c3043ff3bf31a6a81810b4ce9e87ef936f1f529), compiled from scratch
K80 GPU with 12 GB memory

Here is the code to load the inception graph:

INPUT_SIZE = 224
OUTPUT_SIZE = 1024

# input should be: BS x INPUT_SIZE x INPUT_SIZE x 3 tensor
# output: BS x OUTPUT_SIZE
def inferences(images):
    graph_def = tf.GraphDef()
    graph_def.ParseFromString(open('./tensorflow_inception_graph.pb').read())
    for n in graph_def.node:
        # control device from caller
        n.device = ''
    tf.import_graph_def(graph_def, input_map = {'input:0': images}, name = name)
    graph = tf.get_default_graph()
    output = graph.get_tensor_by_name(name + '/avgpool0:0')
    return tf.squeeze(output)

Given the big difference between 640 and 32, there must be something wrong. Either mine or this benchmark. Because tensorflow pre-allocating all memory, I don't know how much memory consumed exactly.

@soumith @Yangqing Please help!

Vijay Vasudevan · Answer 81 · Wed Nov 25 2015 13:05:41 GMT+0800 (China Standard Time)

@raingo: when training we keep the activations for the lower layers to compute the gradients, so a lot of intermediate memory is used during each training step. When doing inference, you only need the activations around to compute the next operation(s), and then they can be freed, so a lot less intermediate state is needed.

Also, based on the comment in #66 (comment), it sounds like GoogleNet training with TF might now work for up to batch 64, but not batch 128. (I'd be surprised if batch 32 doesn't work at HEAD, for sure.)

Raingo · Answer 82 · Wed Nov 25 2015 22:49:00 GMT+0800 (China Standard Time)

@vrv Gotta. Thanks!

David Boho · Answer 83 · Thu Nov 26 2015 23:26:30 GMT+0800 (China Standard Time)

@soumith They made some changes to tensorflow TensorFlow: Improve performance of Alexnet. can you update the benchmark for Alexnet

Oleg Zabluda · Answer 84 · Tue Dec 01 2015 05:26:36 GMT+0800 (China Standard Time)

@soumith

Googlenet with batchsize 128 goes Out of Memory. The largest batch-size I could fit is 16 (tried 16, 32, 64, 128) [...] if my memory is right, it's really tight on space to do a batch size of 128 Googlenet in 12GB, and one needs in-place ops for sure.

For comparison, here are my measurements of approximate peak memory usage with Torch/cuDNNv3 on Titan-X:

AlexNet (128): 3 GB
OverFeat (128): 5 GB
VGG Model-A (128): OOM
GoogLeNet(128): 9G

VGG Model-A-11 (64): 8 G
VGG Model-B-13(64): 12 G (I think this may fall back on slower algos due to tight memory)
VGG Model-D-16 (64): 12 G (I think this may fall back on slower algos due to tight memory)
VGG Model-E-19 (64): 12 G (I think this may fall back on slower algos due to tight memory)

VGG Model-A-11 (96): 11 G

alexatknit · Answer 85 · Fri Dec 11 2015 03:21:16 GMT+0800 (China Standard Time)

@soumith Since its release I've seen pretty dramatic improvements in tensorflow's memory management and performance. I think it may be time to benchmark 0.6.0.

Soumith Chintala · Answer 86 · Fri Dec 11 2015 04:11:34 GMT+0800 (China Standard Time)

@alexatknit will do. i will take some time one of these days to do MXNet, Chainer and TF 0.6. Have been a bit busy lately with wrapping up research.

Hans Gaiser · Answer 87 · Tue Jan 05 2016 16:14:43 GMT+0800 (China Standard Time)

I am looking forward to the updated comparison, have you found time to look into it?

Soumith Chintala · Answer 88 · Tue Jan 05 2016 16:48:40 GMT+0800 (China Standard Time)

TensorFlow Trunk as of 1 hour ago (post 0.6 release) numbers:

AlexNet (One Weird Trick paper) - Input 128x3x224x224

Library	Time (ms)	forward (ms)	backward (ms)
CuDNN-R3 (Torch)	96	32	64
Nervana (Neon)	101	32	69
CuDNN-R2 (Torch)	231	70	161
TensorFlow 0.5	326	96	230
TensorFlow 0.6+	292	70	222

Overfeat [fast] - Input 128x3x231x231

Library	Time (ms)	forward (ms)	backward (ms)
CuDNN-R3 (Torch)	326	113	213
fbfft (Torch)	342	114	227
CuDNN-R2 (Torch)	810	234	576
TensorFlow 0.5	1084	316	768
TensorFlow 0.6+	856	204	652

OxfordNet [Model-A] - Input 64x3x224x224

Library	Time (ms)	forward (ms)	backward (ms)
Nervana	590	180	410
CuDNN-R3 (Torch)	615	196	418
CuDNN-R2 (Torch)	1099	342	757
TensorFlow 0.5	1840	545	1295
TensorFlow 0.6+	1656	347	1309

GoogleNet V1 - Input 128x3x224x224

Library	Time (ms)	forward (ms)	backward (ms)
CuDNN-R3 (Torch)	431	117	313
TensorFlow 0.5	OOM	OOM	OOM
TensorFlow 0.6+	1237	246	991

There you go.
The new logs are all checked in.

Rajat Monga · Answer 89 · Tue Jan 12 2016 15:37:48 GMT+0800 (China Standard Time)

@soumith Thanks for running the numbers again. I know you have been asked to do this a number of times lately and it takes you away from your research. Having these benchmarks have been greatly useful for everyone.

After your run we realized we seem to have regressed in performance since the 0.6.0 release (mostly from our switch over to the public Eigen branch) and over the last few days @zheng-xq and @benoitsteiner along with others have made improvements to get back the performance. When running the benchmarks again at commit d1b8333, we get the following numbers:

Model	Total (ms)	Forward (ms)	Backward (ms)
AlexNet	229	69	160
Overfeat [fast]	839	203	636
OxfordNet	1216	329	887
GoogleNet V1 - Input 128x3x224x224	815	234	581

This is measured on an unsuperclocked Titan-X with the default power-limit 250W.
For consistency, between each run, we wait for a few minutes for GPU to cool down to room temperature.

These results are also in line with what we see at 0.6.0 release.

We are also looking into setting up performance benchmarks with the builds so we don't hit such performance regressions.

Again, Thanks for all your updates.

Vincenzo Caselli · Answer 90 · Mon Jan 25 2016 19:30:00 GMT+0800 (China Standard Time)

Does anyone has experiences and/or comparisons with DL4J (http://deeplearning4j.org) ?

Soumith Chintala · Answer 91 · Tue Jan 26 2016 22:39:22 GMT+0800 (China Standard Time)

@rajatmonga just got back from vacay. It's cool that you guys are setting up contbuilds for perf regressions.

However, I dont get the numbers that you seem to be getting on the tensorflow as of yesterday ( a27d844e05447e65aa279ae5269a2d75590f46f6 ). The numbers are slightly better but not quite the improvement that you are seeing.

Look here for the new numbers: 1f09e1e

Rajat Monga · Answer 92 · Wed Jan 27 2016 09:18:39 GMT+0800 (China Standard Time)

@soumith Thanks for running the benchmarks again. It is possible there are some memory related regressions that are hurting performance again. What you have right now is good, lets not worry about this.

We are working on getting cuDNN R4 fully supported and will address the remaining performance issues in that context. May ping this thread once we have a full release with R4, and it will be worthwhile rerunning benchmarks - likely for many of the libraries.

Also, let me know if we can help you with this project in any way - it is very useful to the community, but I am sure it takes a lot of your time as well. Thanks for keeping this going!

Rajat Monga · Answer 93 · Fri Feb 05 2016 04:31:43 GMT+0800 (China Standard Time)

Yes, That is in our list of tasks and is quite important to make sure we
don't have performance regressions. We haven't been able to get to it yet.

On Thu, Feb 4, 2016 at 9:11 AM Madder notifications@github.com wrote:

Has anyone thought of running these benchmarks periodically as part of
tensorflow's CI for instance?

—
Reply to this email directly or view it on GitHub
#66 (comment)
.

Carles Gelada · Answer 94 · Wed Feb 17 2016 07:02:49 GMT+0800 (China Standard Time)

Tf 0.7.0 released!
Looking forward to the updated benchmarks.

Ronghang Hu · Answer 95 · Wed Feb 24 2016 05:28:53 GMT+0800 (China Standard Time)

Great results 👍 👍 👍

Looking forward to the results with cuDNN v4

Madder · Answer 96 · Wed Feb 24 2016 07:25:46 GMT+0800 (China Standard Time)

+1

On Tue, Feb 23, 2016 at 10:29 PM, Ronghang Hu notifications@github.com
wrote:

Great results [image: 👍] [image: 👍] [image: 👍]

Looking forward to the results with cuDNN v4

—
Reply to this email directly or view it on GitHub
#66 (comment)
.

Soumith Chintala · Answer 97 · Mon Feb 29 2016 08:28:02 GMT+0800 (China Standard Time)

As requested, TF 0.7 + CuDNN R4 has been benchmarked. CuDNN R4 + Torch has also been benchmarked as a baseline.

Within the span of Nervana's Neon, Torch + CuDNN4, TensorFlow + CuDNN4 (and Caffe + CuDNN is likely in the same ballpark as torch), TensorFlow ( at commit tensorflow/tensorflow@1d4f00d ) still lags behind the others by 2x to 3x performance on Alexnet, VGG and Googlenet. It is within 1.5x of Overfeat.

Soumith Chintala · Answer 98 · Mon Feb 29 2016 08:30:22 GMT+0800 (China Standard Time)

For full details, see the main README.md: https://github.com/soumith/convnet-benchmarks/blob/master/README.md and the raw logs are located here: 2888b23