zhihu / cuBERT

Fast implementation of BERT inference directly on NVIDIA (CUDA, CUBLAS) and Intel MKL

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Reproduce running time in README.

Mansterteddy opened this issue · comments

Hi, thanks for sharing this awesome project!

I have met a problem, when I try to use mklBERT with 32 * Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz, I cannot reproduce running time reported in README (2281ms vs 984.9ms), can you give me some advice for reproducing running time?

Yes, we also find CPU performance varies between Intel broadwell and skylake on MKL 2019.0.1.20181227. On that MKL version, skylake is slower than broadwell. When we update to 2019.0.3.20190220 (which is already updated in our master branch at cmake/mkl.cmake), skylake is much faster. We test on Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz.

MKL is the key when tested with CPU, because BERT depends heavily on matrix multiply of GEMM. We also find KMP_BLOCKTIME and KMP_AFFINITY have some small impact. It is OK to start a benchmark without them at first.

Below is a log from our broadwell CPU:

./tfBERT_benchmark 
1.12.0
=== warm_up ===
TF: 4465ms
TF: 1523ms
TF: 1523ms
TF: 1566ms
TF: 1545ms
TF: 1534ms
TF: 1532ms
TF: 1545ms
TF: 1522ms
TF: 1528ms
=== benchmark ===
TF: 1516ms
TF: 1521ms
TF: 1523ms
TF: 1537ms
TF: 1544ms
TF: 1543ms
TF: 1547ms
TF: 1528ms
TF: 1531ms
TF: 1557ms

./cuBERT_benchmark 
model loaded from: bert_frozen_seq32.pb
Found CPU CUBERT_NUM_CPU_MODELS: 1
device setup: 0. Took 1401 milliseconds.
=== warm_up ===
cuBERT: 1116ms
cuBERT: 1212ms
cuBERT: 1056ms
cuBERT: 992ms
cuBERT: 938ms
cuBERT: 906ms
cuBERT: 907ms
cuBERT: 917ms
cuBERT: 1025ms
cuBERT: 906ms
=== benchmark ===
cuBERT: 909ms
cuBERT: 906ms
cuBERT: 925ms
cuBERT: 940ms
cuBERT: 905ms
cuBERT: 928ms
cuBERT: 905ms
cuBERT: 969ms
cuBERT: 1150ms
cuBERT: 963ms

You can also share your performance and detailed machine environments if that is allowed.

Thanks for your kindly reply!

I test mklBERT on a new machine (8 Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz), however, I still cannot reproduce your running time, you can see my log here:

model loaded from: bert_frozen_seq32.pb
Found CPU CUBERT_NUM_CPU_MODELS: 1  
device setup: 0. Took 998 milliseconds.
=== warm_up ===
cuBERT: 3886ms
cuBERT: 3571ms
cuBERT: 3576ms
cuBERT: 3574ms
cuBERT: 3565ms
cuBERT: 3574ms
cuBERT: 3573ms
cuBERT: 3576ms
cuBERT: 3575ms
cuBERT: 3571ms
=== benchmark ===
cuBERT: 3578ms
cuBERT: 3580ms
cuBERT: 3571ms
cuBERT: 3571ms
cuBERT: 3577ms
cuBERT: 3570ms
cuBERT: 3571ms
cuBERT: 3570ms
cuBERT: 3573ms
cuBERT: 3566ms

It seems that this performance is worse than your tf benchmark. Should I do anything before compile mklBERT?

Your result seems reasonable because your CPU only have 8 cores, while our result is ran on 28 cores. Both tensorflow and our code use multi-threading internally, and you can check your CPU usage when the benchmark is running.

Also, it is recommended to run the ./tfBERT_benchmark to have a comparison. tfBERT_benchmark depends on tensorflow C API. You can first install it from https://www.tensorflow.org/install/lang_c and then rebuild our project.

Thanks for your advice! I will post tfBERT_benchmark later. BTW, what does multi-threading internally mean?

It means the computation will be taken on N cores for each request/call.