nnp_convolution_output much slower than nnp_convolution_inference
jiecaoyu opened this issue · comments
Hi, I am testing NNPACK on a Raspberry Pi 3 b+ with a 4-core A53 Arm CPU, and found nnp_convolution_output
much slower than nnp_convolution_inference
(around 4~5x slower). Could you give some insights on why nnp_convolution_output
is so slow? Thanks!
I tried nnp_convolution_output and got:
$ ./bin/convolution-benchmark -ic 256 -oc 256 -is 32 32 -ks 3 3 -m output -i 50 -a wt8x8 -ts compute
Batch size: 1
Input channels: 256
Output channels: 256
Input: 32x32 with implicit padding 0
Kernel: 3x3
Subsampling: 1x1
Algorithm: WT8x8
Threads: 4
Iterations: 50
Time: 443.646 ms
Input transform: 3.923 ms (0.9%) [0.7 GB/s]
Kernel transform: 44.700 ms (10.1%) [0.4 GB/s]
Output transform: 7.864 ms (1.8%) [0.3 GB/s]
Block multiplication: 386.999 ms (87.2%) [0.5 GFLOPS]
Overhead: 0.160 ms (0.0%)
Then I tried nnp_convolution_inference and got:
$ ./bin/convolution-benchmark -ic 256 -oc 256 -is 32 32 -ks 3 3 -m inference -i 50 -a wt8x8 -ts compute
Batch size: 1
Input channels: 256
Output channels: 256
Input: 32x32 with implicit padding 0
Kernel: 3x3
Subsampling: 1x1
Algorithm: WT8x8
Threads: 4
Iterations: 50
Time: 89.602 ms
Input transform: 2.885 ms (3.2%) [0.9 GB/s]
Kernel transform: 33.768 ms (37.7%) [0.6 GB/s]
Output transform: 3.966 ms (4.4%) [0.6 GB/s]
Block multiplication: 48.942 ms (54.6%) [4.3 GFLOPS]
Overhead: 0.042 ms (0.0%)
As you can see, the main difference is from the block multiplication. So I make some change to nnp_convolution_output
to let it use the same kernel as nnp_convolution_inference
:
diff --git a/src/convolution-output.c b/src/convolution-output.c
index 1522cfb..d772c95 100644
--- a/src/convolution-output.c
+++ b/src/convolution-output.c
@@ -386,8 +386,8 @@ static enum nnp_status compute_fast_convolution_output(
matrix_multiplication_context.full_gemm = nnp_hwinfo.cxgemm.cX_conjb_upto_mr_x_nr;
}
} else {
- matrix_multiplication_context.fast_gemm = nnp_hwinfo.sxgemm.only_mr_x_nr;
- matrix_multiplication_context.full_gemm = nnp_hwinfo.sxgemm.upto_mr_x_nr;
+ matrix_multiplication_context.fast_gemm = nnp_hwinfo.hxgemm.only_mr_x_nr;
+ matrix_multiplication_context.full_gemm = nnp_hwinfo.hxgemm.upto_mr_x_nr;
}
pthreadpool_compute_2d_tiled(threadpool,
(pthreadpool_function_2d_tiled_t) compute_matrix_multiplication,
Then the computation peformance is improved:
$ ./bin/convolution-benchmark -ic 256 -oc 256 -is 32 32 -ks 3 3 -m output -i 50 -a wt8x8 -ts compute
Batch size: 1
Input channels: 256
Output channels: 256
Input: 32x32 with implicit padding 0
Kernel: 3x3
Subsampling: 1x1
Algorithm: WT8x8
Threads: 4
Iterations: 50
Time: 297.317 ms
Input transform: 4.166 ms (1.4%) [0.6 GB/s]
Kernel transform: 46.298 ms (15.6%) [0.4 GB/s]
Output transform: 8.054 ms (2.7%) [0.3 GB/s]
Block multiplication: 238.641 ms (80.3%) [0.9 GFLOPS]
Overhead: 0.159 ms (0.1%)
but still much slower than nnp_convolution_inference
.
I think this is because the nnp_convolution_output
is designed for processing a larger batch size. I listed some results here:
./bin/convolution-benchmark -b <Batch-Size> -ic 256 -oc 256 -is 32 32 -ks 3 3 -m output -i 50 -a wt8x8 -ts compute
Batch-Size | Time (ms) | Estimated Relative Time to Inference |
---|---|---|
1 | 297.317 | 3.32x |
2 | 320.964 | 1.79x |
4 | 493.790 | 1.37x |
8 | 669.373 | 0.93x |
16 | 1127.133 | 0.79x |
When Batch size > 8
, nnp_convolution_output
gets a win.
However, I think it might still be necessary to change the gemm kernel used in nnp_convolution_output
to support Arm backend.