ysh329 / OpenCL-101

Learn OpenCL step by step.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

common Error Q&A

ysh329 opened this issue · comments

GPU 优势

GPU达到CPU最高帧率时的功率消耗只有CPU的一半。这段话来自An Independent Evaluation of
Implementing Computer Vision Functions with OpenCL on the Qualcomm Adreno 420 | Berkeley Design Technology, Inc. July 2015
,原文如下:

Qualcomm has reported that the GPU mode of the demo consumes half as much power as the CPU mode when throttling the frame rate of the GPU mode to match the highest frame rate achieved in the CPU mode.

其实这篇基于Adreno430的文章要点如下:算法实现必须最大限度地提高并行性,并符合GPU的内存系统和核心架构,文章讨论了这几点:

  1. 最小化GPU和CPU之间的的内存拷贝:snapdragon855使用adreno640的GPU,根据OpenCL-Z有如下数据
    • Host to Device: 10.51 GByte/s
    • Device to Host: 4.54 GByte/s
    • Device to Device: 23.12 GByte/s
      换句话说,要避免模型串联计算的时候的CPU、GPU交叉调用,尤其是当下一层的feature map特别大的情况下,还要把计算交给GPU来做,因为下载数据慢很可能就不划算;
  2. 小心管理有限的快速本地内存(Local Memory)。
  3. 即使用高级语言(如OpenCL),也必须掌握GPU的核心体系结构特征,让编程符合架构特征来做优化。例如,代码必须减少分支的使用,并注意使用最合适的SIMD数据类型。

CL_INVALID_KERNEL_ARGS

CL_INVALID_KERNEL_ARGS if the kernel argument values have not been specified.

clEnqueueNDRangeKernel
https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clEnqueueNDRangeKernel.html

printf_buffer_metadata corrupt!

printf("===== a:%d\n", a8x4[0].s0);

Var a8x4 is of float vector type, but format symbol %d used, which should correct as below:

printf("===== a:%f\n", a8x4[0].s0);

Debug

有个printf函数可以用,非常方便,此外也可以打印vector矢量,khronos.org的OpenCL 1.2和2.0的文档对这个printf Function说明是一样的。这里提一下打印矢量的方法:printf("f4 = %2.2v4hlf\n", f);,其中f4float4类型。

目前发现能打印的主要是高通骁龙SoC的GPU,但是骁龙系列也有例外,遇到似乎是骁龙410的GPU在加入printf后,在ADB Shell环境执行,会卡主,注释掉printf就不会,可能这个410不支持printf?这个不确定。但mali是没法打印的。


  • 打印kernel时,在打印部分的代码的前后加宏定义如PRINT_KERNEL来控制。相当于两个模式:benchmark和debug,方便切换;
  • 出现计算结果错误时,因为多线程而且规模较大,可以写for从1开始循环各个值,拿到错误的最小规模来排查。尤其是对于Image2D的方式实现的kernel,找到小规模比较方便,另外是需要写一段CPU的代码来模拟Image2D,更好更快的定位这个出错小规模下Image2D和Buffer形式的索引转换,例如对应的错误起始的线程global_work_size对应的id分别为(0,0,1)为起始的计算错误的线程,用CPU模拟找到对应的Buffer形式的索引值;
  • 单元测试的for循环规模大小,可以基于kernel实现的规模来设定,比方gemm的kernel每个线程计算4x8C矩阵,那么单元测试的最大规模可以跑(4 + 3) x (8 + 7)刚好大于这个规模作为上界(检查少于和超出边界的情况),每次自增1,这样的小规模也方便后期排查错误时复现和手算;
  • 单元测试中,随机初始化的值建议用整型,易于手算;
  • 单元测试中,排除小规模的计算错误,可以在kernel中把所有线程打出来,即global_id以及各自分量所代表的参数值,如rowcolbatch_idx
  • 单元测试中,加入宏LOOP_TEST定义来遍历各种从最小规模到最大规模,同时针对#else的情况,设置检查单个出错的bad case;
  • 写kernel过程中,留下最naive的代码,用来自检查。此外,也可以在排查错误的过程中,通过把正确代码插入到错误kernel中,加入#if 0#else来控制,确定哪部分出错,也可以用该方法来调试CPU的kernel;
  • 边界问题带来的加载数据不正确,在保存时对边界写入处理得当,就不会出现。比方矩阵乘法c = a * b,加载ab时,因为列优先(列主序列)存储的方式,若ab的列数小于4,每次加载以矢量vload4加载4个,那么比方float4 b4x4[4]的第一个元素b4x4[0]中的b4x4[0].s1b4x4[0].s2b4x4[0].s3就会加载出错,矩阵bc也有这个问题,但实际并没关系,因为存的时候处理好了边界。并不会将这些值写入结果中;
  • 出现错误时,不加控制的打印,尤其是当实际业务中,计算规模很大,会导致打印处相关的代码的结果全部出来,难以排查,可以加if-else判断,控制打印哪一个线程便于排查问题(也方便手算)。
    • 场景:开启PRINT_KERNEL的宏定义计算结果正确,关闭就部分错误。猜想是出现内存复写的情况,打印出内存地址,发现确实存在两个不同线程对同一地址的复写(前不久确实发现有两个线程出现内存地址复写的情况,有两个不同的线程,打印出来了相同的地址。原因是在保存结果矩阵的时候,对边界遍历的for的上界设置的条件不足导致的);
        #ifdef PRINT_KERNEL
        if (row == 0 && col == 0 && bidx == 0) {
            for (int i = 0; i < 8; ++i) {
                printf("row = col = bidx = 0 initialize c8x4[%d] = %2v4hlf\n", i, c8x4[i]);
            }
        }
        #endif

更多方式可以看How to debug — MACE documentation

性能

  • adreno的local work size设置。上调试发现local work size默认的NullRange和我设置的{16, 16},跑mobilentv1性能没啥差别,需要进一步查看,可以搜搜;
  • 交换global work size的三个值的排布顺序试试看性能变化,搜一下;

Buffer Vs. Image

image

Adreno GPU SDK - FAQs - Qualcomm Developer Network

https://developer.qualcomm.com/software/adreno-gpu-sdk/faq

What is included in the Adreno SDK for OpenCL?

This SDK includes usage examples for Qualcomm Technologies extensions to OpenCL including:

  • Built-in extension functions
  • Compressed images
  • Efficient convolution with and without new built-in extension functions.
  • Variety of kernels using vector read and write operations for given image formats

What is new in the Adreno SDK for OpenCL v1.2?

The OpenCL SDK version 1.2 contains many new examples, including:

  • Bayer/MIPI image examples
  • Matrix multiplication examples, and some other matrix operations
  • Fast Fourier-transform (FFT) examples

OpenCL Optimization from Qualcomm

OpenCL Ref. from Qualcomm

Hardware Tutorials

Adreno GPU SDK - Tutorial Videos - Qualcomm Developer Network
https://developer.qualcomm.com/software/adreno-gpu-sdk/tutorial-videos

  • Adreno Hardware Tutorial 1: Graphics Pipeline Overview
    Watch an overview of the mobile Graphics Pipeline which describes the data flow and processing of graphics data from software to hardware. The APIs available on mobile devices are also covered.

  • Adreno Hardware Tutorial 2: Introduction to the Adreno GPU
    The Adreno Graphics Processing Unit (GPU) is designed to bring console and PC quality 3D games to mobile devices. This video introduces the viewer to the Adreno GPU, some of its key features, and how it relates to the Qualcomm Snapdragon processor.

  • Adreno Hardware Tutorial 3: Tile Based Rendering
    This video investigates the commonly used technique of Tile Based Rendering on mobile devices by focusing specifically on Adreno GPUs using a toon shading sample.

  • Adreno Hardware Tutorial 4: Detecting Adreno GPU
    This video demonstrates a simple way to detect Adreno GPUs using C++ (Native) code.

Others

OpenCL Tips · yszheda/wiki Wiki
https://github.com/yszheda/wiki/wiki/OpenCL-Tips

Sub-optimal performance on Qualcomm Adreno GPUs · Issue #228 · CNugteren/CLBlast
CNugteren/CLBlast#228

Float16 GEMM on Adreno 330 · Issue #181 · CNugteren/CLBlast
CNugteren/CLBlast#181

do not have a certain result of float16

local work size和work group size

Opencl global work size vs local work size

In both cases the global size is 1024. In case 1, the local size is 128 and this results in an execution partition that creates 8 work-groups, each of which will iterate through 128 work-items. In case 2, the local size is changed to 256 and this results in 4 work-groups, each with 256 work-items.

Understanding Kernels, Work-groups and Work-items — TI OpenCL User's Guide
https://downloads.ti.com/mctools/esd/docs/opencl/execution/kernels-workgroups-workitems.html

double OpenCLRuntime::getCostTime(const cl::Event *event){
    mCommandQueuePtr->finish();
    mStartNanos = event->getProfilingInfo<CL_PROFILING_COMMAND_START>();
    mStopNanos = event->getProfilingInfo<CL_PROFILING_COMMAND_END>();
    return (mStopNanos - mStartNanos) / 1000000.0;
}

double OpenCLRuntime::getQueuedTime(const cl::Event *event){
    mCommandQueuePtr->finish();
    return (event->getProfilingInfo<CL_PROFILING_COMMAND_START>() - event->getProfilingInfo<CL_PROFILING_COMMAND_QUEUED>()) / 1000000.0;
}

double OpenCLRuntime::getSubmitTime(const cl::Event *event){
    mCommandQueuePtr->finish();
    return (event->getProfilingInfo<CL_PROFILING_COMMAND_START>() - event->getProfilingInfo<CL_PROFILING_COMMAND_SUBMIT>()) / 1000000.0;
}