koba-jon / pytorch_cpp

Deep Learning sample programs using PyTorch in C++

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Training in C++?

meet-minimalist opened this issue · comments

Hey, you have showed such an amazing work to train NNs in C++.
I would like to know what are the reasons for which you started training models in C++ instead of python? Because once the model definition has been written in pytorch in python and data pipeline has been setup, all the computation needs to be done will be done on GPU. So there wont be much drastic performance gains when migrating from python to c++. Please tell me some of your thoughts on this.

I was interested in implementing the NN programs in C++, and I want to improve my coding ability in C++, so I decided to write this code.
However, I have investigated how mush the speed is different between Python and C++.

I found a strange result that there are cases in which it runs faster in Python than in C++.
Here, the batch size, image size, and almost all components are matched between Python and C++.
And, NN model used for training and test in C++:
https://github.com/koba-jon/pytorch_cpp/tree/master/Dimensionality_Reduction/AE2d
My article for details (Japanese only):
https://qiita.com/koba-jon/items/274e5e4970da72216f73

CPU (Core i7-8700) only with GPU (GeForce GTX 1070)
Python C++ Python C++
nondeterministic(cudnn) deterministic(cudnn)
Training times[time/epoch] 1h04m49s 1h03m00s 5m53s 7m42s 17m36s
GPU memory[MiB] 2 9 933 913 2941
Testing speed[seconds/data] 0.01189 0.01477 0.00102 0.00101 0.00101

Training in C++ is slow when the GPU is used.
It has been identified that the causes of the delay are the "forward" and "backward" part, so it's not the part I wrote.
This speed is faster than when it is CPU only, so it seems that CUDA is being used.
But, I guess the coding of PyTorch for using the GPU may be different between Python and C++.

I have heard reports that the speed in C++ is faster when training only fully connected layers.
I plan to investigate this matter around March.

Huge thanks to you for these interesting insights. I suspect the reason for high training time on C++ could be under optimized data pipeline or under-optimized-and-serialized CPU-to-GPU and GPU-to-CPU data copy instead of parallel async copy. But, still further investigation might help. I was also curious about this as many people used to train models on C++ and I wonder what on earth forced them to do this. :-P

This is a follow-up report.

I benchmarked again using the following three kinds of Neural Networks in PyTorch v1.8.0.
I measured the training time based on iteration per second.

My article for details (Japanese only): https://qiita.com/koba-jon/items/59a64c6ec38ac7286d6b

  1. Only Fully Connected Layers
    Model: AE1d
CPU(Core i7-8700) GPU(GeForce GTX 1070)
cudnn: deterministic cudnn: non-deterministic
Python[iteration/s] 86.83 97.69 97.69
C++[iteration/s] 312.6 312.6 312.6
Speed Up (Python -> C++) ×3.6 ×3.2 ×3.2
  1. Only Convolutional Layers
    Model: Discriminator
CPU(Core i7-8700) GPU(GeForce GTX 1070)
cudnn: deterministic cudnn: non-deterministic
Python[iteration/s] 5.24 27.59 39.08
C++[iteration/s] 4.51 26.8 36.08
Speed Up (Python -> C++) ×0.86 ×0.97 ×0.92
  1. Convolutional and Transposed Convolutional Layers
    Model: AE2d
CPU(Core i7-8700) GPU(GeForce GTX 1070)
cudnn: deterministic cudnn: non-deterministic
Python[iteration/s] 1.14 9.56 14.39
C++[iteration/s] 1.05 9.16 13.44
Speed Up (Python -> C++) ×0.92 ×0.96 ×0.93

As above, compared to before, the speed of "AE2d" in C++ is much faster and improved.
However, it still couldn't beat the execution speed of Python.

Looking at the details, it seems that the convolutional layer and the transposed convolutional layer are bad.
However, in the case of only Fully Connected Layers, execution speed in C++ is much faster than Python.
This alone may be worth training in C++.

I look forward to future improvements in the PyTorch C++ API for models of "2" and "3".

Thanks a lot for such a detailed experiments. One more thing that I would like to share to you that I have recently discovered, when transferring the training data from RAM to GPU memory, people generally use the concept of Pinned memory and that is a designated area of RAM from which memory copy into GPU memory is faster. I have seen this while working with TensorRT related operations in C++ where they allocate an input tensor memory on pinned memory area and once the data is there on this pinned memory they will call memcpy command to copy data from this pinned memory area to GPU for further computation. This may again give you some boost in C++ timings.

PS : Please create a paper / medium article of your findings along side this qiita.com blog. Because world needs to know about your findings. Keep it up.

Thank you for sharing such information.
I follow the page below, and I will try to improve class "dataloader".
https://pytorch.org/docs/stable/data.html#memory-pinning

Please look forward to follow-up report.