CuBERT not utilizing all threads with multi-cpu

volker42maru opened this issue · comments

Hi there,

I was running cuBERT_benchmark.py and noticed that CuBERT does not utilize all threads when using multiple CPUs (even when setting MKL_NUM_THREADS and OMP_NUM_THREADS). It seems that only CPU#1 is fully utilized in my case, while CPU#2 is almost idle (see attached image). Is there a reason for this behaviour?


I compared by running TF-BERT and it utilizes all threads of both CPUs.

Also, I am trying to use CuBERT in another application where I use multi-processing as well. Is it possible that python's multiprocessing is interfering with CuBERT's multi-threading? Somehow CuBERT is running slower in this application (and it utilizes only some threads totally irregularly) than TF-BERT, while it's faster when I run the benchmark.

Thanks for your help

CuBERT seems to be running on all threads of all CPUs now. It was an issue with the KMP flag it seems. But actually it's slower in benchmark when utilizing all threads.

Anyway, seems I have to experiment a bit with the flags to get it running properly. CuBERT is still slower than in the benchmarks though when I use it in my other application with multi-processing.

What CPU do you use? Do you run cuBERT inside docker with limited CPU quota? Does the caller have many threads and call cuBERT concurrently?

Could you provide the running time of benchmark_tf.cpp and benchmark_cu.cpp?

I am running both TF-BERT and CuBERT in python at the moment, because my server is also implemented in python. I included TF-BERT into the python benchmark script by loading the frozen graph into a TF session.

Here are the results for seq_len=32, bsz=128:

=== benchmark TF Version ===
TF-BERT: 1849.33740234375 ms
TF-BERT: 1877.760986328125 ms
TF-BERT: 1908.39501953125 ms
TF-BERT: 1888.963134765625 ms
TF-BERT: 1909.74169921875 ms
TF-BERT: 1910.92138671875 ms
TF-BERT: 1872.8515625 ms
TF-BERT: 1885.251708984375 ms
TF-BERT: 1908.847900390625 ms
TF-BERT: 1900.97265625 ms

=== benchmark CuBERT Version ===
cuBERT: 1507.418701171875 ms
cuBERT: 1759.540283203125 ms
cuBERT: 1363.7158203125 ms
cuBERT: 1274.797119140625 ms
cuBERT: 1331.03173828125 ms
cuBERT: 1634.052001953125 ms
cuBERT: 1533.854736328125 ms
cuBERT: 1359.9267578125 ms
cuBERT: 1317.154296875 ms
cuBERT: 1329.90966796875 ms

So in this case, CuBERT is indeed faster than the TF version. I am running the test on 2 * Intel® Xeon® Processor E5-2637 v4 (16 threads in total)

The problem I have right now seems to have something to do with the threading scheduling. When I set KMP_AFFINITY=compact in my python server (I am running my Bert Worker in a separate python process), the inference gets terribly slow and CuBERT seems to utilize only 1 thread (out of 16 available).

When I set the KMP_AFFINITY=none, CuBERT actually utilizes all threads available, but in this case it is still slower than TF-BERT (probably the threading schedule strategy affects performance significantly).

I am using your suggested flags: KMP_BLOCKTIME=0 KMP_AFFINITY=granularity=fine,verbose,compact,1,0 MKL_NUM_THREADS=16

I would really appreciate your input

Do you run cuBERT inside docker with limited CPU quota?

I am not running inside a docker container. I use the same conda environment and the same CPU server for benchmarking/inference server.

Does the caller have many threads and call cuBERT concurrently?

At the moment I am testing without concurrency