Performance drops on deeper convolution layers? (Pixel 4)

Question

Performance drops on deeper convolution layers? (Pixel 4)

cakeng opened this issue 3 years ago · comments

Hello.

We're testing convolution performances of XNNPACK on Google Pixel 4. (Android 11, CPU only, 4 threads)

We've found that XNNPACK's throughput drops quite significantly in the deeper convolution layers of VGG16.
On the layers with output channel size smaller than 512, throughput is at about 110 ~ 120 GFLOP/s,
but in the last 6 convolution layers with output channel size 512 the throughput drops to about 60GFLOP/s.

Is this behavior normal? We cross-compiled XNNPACK from source with Android NDK, and included in our test code as follows.

#include <xnnpack.h>
double xnnpackTest(float *xnnpackOutput, int runNum)
{
    pthreadpool_t threadpool = pthreadpool_create(threadNum);
    auto t1 = std::chrono::high_resolution_clock::now();
    auto t2 = std::chrono::high_resolution_clock::now();

    float *input = NCHWtoNHWC(testInputTensor, 1, testChannels, testHeight, testWidth);
    float *filter = NCHWtoNHWC(testFilterTensor, testBlocks, testChannels, testFilHeight, testFilWidth);

    xnn_status status;
    if (xnn_initialize(nullptr /* allocator */) != xnn_status_success)
    {
        std::cerr << "failed to initialize XNNPACK" << std::endl;
    }
    xnn_operator_t op0 = nullptr;
    status = xnn_create_convolution2d_nhwc_f32(
        padding /* top padding */, padding /* right padding */,
        padding /* bottom padding */, padding /* left padding */,
        testFilHeight /* kernel height */, testFilWidth /* kernel width */,
        stride /* subsampling height */, stride /* subsampling width */,
        dilation /* dilation_height */, dilation /* dilation_width */,
        1 /* groups */,
        testChannels /* input channels per group */,
        testBlocks /* output_channels_per_group */,
        testChannels /* input pixel stride */,
        testBlocks /* output pixel stride */,
        filter, testBiasTensor,
        -(__builtin_inff()) /* output min */, (__builtin_inff()) /* output max */,
        0 /* flags */,
        &op0);
    if (status != xnn_status_success)
    {
        std::cerr << "failed to create operation #0 - status: " << status << std::endl;
    }
    status = xnn_setup_convolution2d_nhwc_f32(
        op0,
        1 /* batch size */, testHeight /* input height */, testWidth /* input width */,
        input /* input */, xnnpackOutput /* output */,
        threadpool /* threadpool */);
    if (status != xnn_status_success)
    {
        std::cerr << "failed to setup operation #0" << std::endl;
    }
    for (size_t i = 0; i < runNum/10; i++)
    {
        status = xnn_run_operator(op0, threadpool);
    }
    // Execute network
    t1 = std::chrono::high_resolution_clock::now();
    for (size_t i = 0; i < runNum; i++)
    {
        status = xnn_run_operator(op0, threadpool);
    }
    t2 = std::chrono::high_resolution_clock::now();
    free(input);
    free(filter);
    return (double)std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count() / (double)runNum;
}

Is there anything wrong in our test code? Computation results were correct.
(We've given 1024 for "runNum" in our tests.)

By the way, does "igemm" in kernel names refer to the Indirect GEMM? I've found that Implicit GEMM method used in cuDNN also has the same abbreviation.

Thank you.