atranitell / OpenCL-Benchmarks

A benchmarks to evaluate different convolution operators running on various devices.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Device Attributes

  • Device: Mali T860
  • OpenCL Version: v1.2
  • Max clock frequency : 200MHz
  • Max compute units: 4

Convolution Ops Analysis

  • inputs: [1, 512, 13, 13]
  • weights: [1024, 512, 3, 3], stride 1, padding 1
  • bias: [1024]
  • outputs: [1, 1024, 13, 13]

Setting: normal

__kernel void conv2d(__global const float* in,
                     __global float* out,
                     __global float* filter,
                     __global float* bias,
                     const int in_c,
                     const int in_h,
                     const int in_w,
                     const int out_c,
                     const int out_h,
                     const int out_w,
                     const int filter_h,
                     const int filter_w,
                     const int stride_h,
                     const int stride_w,
                     const int pad_h,
                     const int pad_w,
                     const int dilation_h,
                     const int dilation_w) {
  int ow = get_global_id(0);
  int oh = get_global_id(1);
  int oc = get_global_id(2);

  if (ow >= out_w || oh >= out_h || oc >= out_c) return;

  int ih = oh * stride_h - pad_h;
  int iw = ow * stride_w - pad_w;
  int p_fc = oc * in_c * filter_h * filter_w;
  int p_out = oc * out_h * out_w + oh * out_w + ow;

  float reg = 0;
  for (int ic = 0; ic < in_c; ++ic) {
    int p_in = ic * in_h * in_w + ih * in_w + iw;
    int p_f = p_fc + ic * filter_h * filter_w;
    for (int fh = 0; fh < filter_h; ++fh) {
      for (int fw = 0; fw < filter_w; ++fw) {
        int ix = iw + fw * dilation_w;
        int iy = ih + fh * dilation_h;
        if (ix >= 0 && ix < in_w && iy >= 0 && iy < in_h) {
          reg += filter[p_f + fh * filter_w + fw] * in[p_in + fw * dilation_w];
        }
      }
      p_in += in_w * dilation_w;
    }
  }

  out[p_out] = reg + bias[oc];
}
Setting Queue Submit Run Total
normal 0.146 1.193 913.301 913.750

Setting: predefined macros

we could pass the macros to unrolling the convolution ops.

  • s1: filter
#define FILTER_H 3
#define FILTER_W 3
  • s2: filter + dilation
#define DILATION_H 1
#define DILATION_W 1
  • s3: filter + dilation + pad + stride
#define PAD_H 1
#define PAD_W 1
#define STRIDE_H 1
#define STRIDE_W 1
  • s4: filter + dilation + pad + stride + in
#define IN_C 512
Setting Queue Submit Run Total
macros s1 0.166 0.579 961.758 962.171
macros s2 0.100 0.510 747.560 747.969
macros s3 0.131 0.862 747.498 748.229
macros s4 0.043 0.357 746.977 747.291

Setting: unrolling

the for statement could be unrolled by a sequence structure.

  • s1: unrolling and reduce unnessary ops
iw + 0 * DILATION_W --> iw
iw + 1 * DILATION_W --> iw + DILATION_W
  • s2: keep structure
iw + 0 * DILATION_W + 0
iw + 1 * DILATION_W + 1
Setting Queue Submit Run Total
unroll s1 0.151 0.571 817.677 818.097
unroll s2 0.104 0.821 793.995 794.712

Setting: Internal function

using internal function instead of normal ops

a + b * c -> mad24(b, c, a); / fma(b, c, a);
Setting Queue Submit Run Total
innner s1 0.095 0.758 747.630 248.293

Setting: optimize operator

divided into 2 parts: one without condition statements, another (border) with condition.


Setting Queue Submit Run Total
conv s1 0.072 0.884 579.417 580.229
  • border -> 197 ms
  • non-border -> 270 ms

About

A benchmarks to evaluate different convolution operators running on various devices.

License:Apache License 2.0


Languages

Language:C++ 59.1%Language:C 35.1%Language:CMake 5.0%Language:Python 0.7%