common Error Q&A
ysh329 opened this issue · comments
GPU 优势
GPU达到CPU最高帧率时的功率消耗只有CPU的一半。这段话来自An Independent Evaluation of
Implementing Computer Vision Functions with OpenCL on the Qualcomm Adreno 420 | Berkeley Design Technology, Inc. July 2015,原文如下:
Qualcomm has reported that the GPU mode of the demo consumes half as much power as the CPU mode when throttling the frame rate of the GPU mode to match the highest frame rate achieved in the CPU mode.
其实这篇基于Adreno430的文章要点如下:算法实现必须最大限度地提高并行性,并符合GPU的内存系统和核心架构,文章讨论了这几点:
- 最小化GPU和CPU之间的的内存拷贝:snapdragon855使用adreno640的GPU,根据OpenCL-Z有如下数据:
- Host to Device: 10.51 GByte/s
- Device to Host: 4.54 GByte/s
- Device to Device: 23.12 GByte/s
换句话说,要避免模型串联计算的时候的CPU、GPU交叉调用,尤其是当下一层的feature map特别大的情况下,还要把计算交给GPU来做,因为下载数据慢很可能就不划算;
- 小心管理有限的快速本地内存(Local Memory)。
- 即使用高级语言(如OpenCL),也必须掌握GPU的核心体系结构特征,让编程符合架构特征来做优化。例如,代码必须减少分支的使用,并注意使用最合适的SIMD数据类型。
CL_INVALID_KERNEL_ARGS
CL_INVALID_KERNEL_ARGS if the kernel argument values have not been specified.
clEnqueueNDRangeKernel
https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/clEnqueueNDRangeKernel.html
printf_buffer_metadata corrupt!
printf("===== a:%d\n", a8x4[0].s0);
Var a8x4
is of float vector
type, but format symbol %d
used, which should correct as below:
printf("===== a:%f\n", a8x4[0].s0);
Debug
有个printf
函数可以用,非常方便,此外也可以打印vector
矢量,khronos.org
的OpenCL 1.2和2.0的文档对这个printf Function说明是一样的。这里提一下打印矢量的方法:printf("f4 = %2.2v4hlf\n", f);
,其中f4
是float4
类型。
目前发现能打印的主要是高通骁龙SoC的GPU,但是骁龙系列也有例外,遇到似乎是骁龙410的GPU在加入printf后,在ADB Shell环境执行,会卡主,注释掉printf
就不会,可能这个410不支持printf
?这个不确定。但mali是没法打印的。
- 打印kernel时,在打印部分的代码的前后加宏定义如
PRINT_KERNEL
来控制。相当于两个模式:benchmark和debug,方便切换; - 出现计算结果错误时,因为多线程而且规模较大,可以写
for
从1开始循环各个值,拿到错误的最小规模来排查。尤其是对于Image2D的方式实现的kernel,找到小规模比较方便,另外是需要写一段CPU的代码来模拟Image2D,更好更快的定位这个出错小规模下Image2D和Buffer形式的索引转换,例如对应的错误起始的线程global_work_size对应的id分别为(0,0,1)为起始的计算错误的线程,用CPU模拟找到对应的Buffer形式的索引值; - 单元测试的
for
循环规模大小,可以基于kernel实现的规模来设定,比方gemm的kernel每个线程计算4x8
的C
矩阵,那么单元测试的最大规模可以跑(4 + 3) x (8 + 7)
刚好大于这个规模作为上界(检查少于和超出边界的情况),每次自增1,这样的小规模也方便后期排查错误时复现和手算; - 单元测试中,随机初始化的值建议用整型,易于手算;
- 单元测试中,排除小规模的计算错误,可以在kernel中把所有线程打出来,即
global_id
以及各自分量所代表的参数值,如row
、col
、batch_idx
; - 单元测试中,加入宏
LOOP_TEST
定义来遍历各种从最小规模到最大规模,同时针对#else
的情况,设置检查单个出错的bad case; - 写kernel过程中,留下最naive的代码,用来自检查。此外,也可以在排查错误的过程中,通过把正确代码插入到错误kernel中,加入
#if 0
、#else
来控制,确定哪部分出错,也可以用该方法来调试CPU的kernel; - 边界问题带来的加载数据不正确,在保存时对边界写入处理得当,就不会出现。比方矩阵乘法
c = a * b
,加载a
和b
时,因为列优先(列主序列)存储的方式,若a
和b
的列数小于4,每次加载以矢量vload4
加载4个,那么比方float4 b4x4[4]
的第一个元素b4x4[0]
中的b4x4[0].s1
、b4x4[0].s2
、b4x4[0].s3
就会加载出错,矩阵b
和c
也有这个问题,但实际并没关系,因为存的时候处理好了边界。并不会将这些值写入结果中; - 出现错误时,不加控制的打印,尤其是当实际业务中,计算规模很大,会导致打印处相关的代码的结果全部出来,难以排查,可以加
if-else
判断,控制打印哪一个线程便于排查问题(也方便手算)。- 场景:开启
PRINT_KERNEL
的宏定义计算结果正确,关闭就部分错误。猜想是出现内存复写的情况,打印出内存地址,发现确实存在两个不同线程对同一地址的复写(前不久确实发现有两个线程出现内存地址复写的情况,有两个不同的线程,打印出来了相同的地址。原因是在保存结果矩阵的时候,对边界遍历的for
的上界设置的条件不足导致的);
- 场景:开启
#ifdef PRINT_KERNEL
if (row == 0 && col == 0 && bidx == 0) {
for (int i = 0; i < 8; ++i) {
printf("row = col = bidx = 0 initialize c8x4[%d] = %2v4hlf\n", i, c8x4[i]);
}
}
#endif
更多方式可以看How to debug — MACE documentation
性能
- adreno的
local work size
设置。上调试发现local work size
默认的NullRange
和我设置的{16, 16}
,跑mobilentv1性能没啥差别,需要进一步查看,可以搜搜; - 交换
global work size
的三个值的排布顺序试试看性能变化,搜一下;
Buffer Vs. Image
Adreno GPU SDK - FAQs - Qualcomm Developer Network
https://developer.qualcomm.com/software/adreno-gpu-sdk/faq
What is included in the Adreno SDK for OpenCL?
This SDK includes usage examples for Qualcomm Technologies extensions to OpenCL including:
- Built-in extension functions
- Compressed images
- Efficient convolution with and without new built-in extension functions.
- Variety of kernels using vector read and write operations for given image formats
What is new in the Adreno SDK for OpenCL v1.2?
The OpenCL SDK version 1.2 contains many new examples, including:
- Bayer/MIPI image examples
- Matrix multiplication examples, and some other matrix operations
- Fast Fourier-transform (FFT) examples
OpenCL Optimization from Qualcomm
- Matrix Multiply on Adreno GPUs – Part 1: OpenCL Optimization - Qualcomm Developer Network
https://developer.qualcomm.com/blog/matrix-multiply-adreno-gpus-part-1-opencl-optimization - Matrix Multiply on Adreno GPUs – Part 2: Host Code and Kernel - Qualcomm Developer Network
https://developer.qualcomm.com/blog/matrix-multiply-adreno-gpus-part-2-host-code-and-kernel - OpenCL Optimization: Accelerating the Epsilon Filter on the Adreno GPU - Qualcomm Developer Network
https://developer.qualcomm.com/blog/opencl-optimization-accelerating-epsilon-filter-adreno-gpu - OpenCL Optimization: Accelerating the Sobel Filter on Adreno GPU - Qualcomm Developer Network
https://developer.qualcomm.com/blog/opencl-optimization-accelerating-sobel-filter-adreno-gpu - OpenCL Optimization: Stop Leaving Compute Cycles on the Table - Qualcomm Developer Network
https://developer.qualcomm.com/blog/opencl-optimization-stop-leaving-compute-cycles-table - Implementing Computer Vision Functions with OpenCL on the Qualcomm Adreno 420 - Qualcomm Developer Network
https://developer.qualcomm.com/download/adrenosdk/bdti_adreno_white_paper.pdf
OpenCL Ref. from Qualcomm
- [Software] Adreno GPU SDK - Tools - Qualcomm Developer Network
https://developer.qualcomm.com/software/adreno-gpu-sdk/tools - [Software] Snapdragon Profiler - Qualcomm Developer Network
https://developer.qualcomm.com/software/snapdragon-profiler - [Docs] Qualcomm Snapdragon Mobile Platform OpenCL General Programming and Optimization Guide
https://developer.qualcomm.com/download/adrenosdk/adreno-opencl-programming-guide.pdf?referrer=node/6114 - [Docs] Implementing Computer Vision Functions with OpenCL on the Qualcomm Adreno 420
Implementing Computer Vision Functions with OpenCL on the Qualcomm Adreno 420 - [Developer Forums] Adreno GPU SDK Forum
https://developer.qualcomm.com/forums/qdevnet-forums/mobile-gaming-graphics-optimization-adreno
Hardware Tutorials
Adreno GPU SDK - Tutorial Videos - Qualcomm Developer Network
https://developer.qualcomm.com/software/adreno-gpu-sdk/tutorial-videos
-
Adreno Hardware Tutorial 1: Graphics Pipeline Overview
Watch an overview of the mobile Graphics Pipeline which describes the data flow and processing of graphics data from software to hardware. The APIs available on mobile devices are also covered. -
Adreno Hardware Tutorial 2: Introduction to the Adreno GPU
The Adreno Graphics Processing Unit (GPU) is designed to bring console and PC quality 3D games to mobile devices. This video introduces the viewer to the Adreno GPU, some of its key features, and how it relates to the Qualcomm Snapdragon processor. -
Adreno Hardware Tutorial 3: Tile Based Rendering
This video investigates the commonly used technique of Tile Based Rendering on mobile devices by focusing specifically on Adreno GPUs using a toon shading sample. -
Adreno Hardware Tutorial 4: Detecting Adreno GPU
This video demonstrates a simple way to detect Adreno GPUs using C++ (Native) code.
Others
- Snapdragon Power Optimization SDK - Qualcomm Developer Network
https://developer.qualcomm.com/software/snapdragon-power-optimization-sdk - Qualcomm Adreno GPU 如何获得更好的OpenCL性能——内存优化篇-CSDN.NET
https://www.csdn.net/article/a/2016-06-09/15838600 - Adreno GPU 矩阵乘法——第1讲:OpenCL优化-CSDN.NET
https://www.csdn.net/article/a/2016-10-25/15841503
OpenCL Tips · yszheda/wiki Wiki
https://github.com/yszheda/wiki/wiki/OpenCL-Tips
Sub-optimal performance on Qualcomm Adreno GPUs · Issue #228 · CNugteren/CLBlast
CNugteren/CLBlast#228
Float16 GEMM on Adreno 330 · Issue #181 · CNugteren/CLBlast
CNugteren/CLBlast#181
do not have a certain result of float16
local work size和work group size
Opencl global work size vs local work size
In both cases the global size is 1024. In case 1, the local size is 128 and this results in an execution partition that creates 8 work-groups, each of which will iterate through 128 work-items. In case 2, the local size is changed to 256 and this results in 4 work-groups, each with 256 work-items.
Understanding Kernels, Work-groups and Work-items — TI OpenCL User's Guide
https://downloads.ti.com/mctools/esd/docs/opencl/execution/kernels-workgroups-workitems.html
double OpenCLRuntime::getCostTime(const cl::Event *event){
mCommandQueuePtr->finish();
mStartNanos = event->getProfilingInfo<CL_PROFILING_COMMAND_START>();
mStopNanos = event->getProfilingInfo<CL_PROFILING_COMMAND_END>();
return (mStopNanos - mStartNanos) / 1000000.0;
}
double OpenCLRuntime::getQueuedTime(const cl::Event *event){
mCommandQueuePtr->finish();
return (event->getProfilingInfo<CL_PROFILING_COMMAND_START>() - event->getProfilingInfo<CL_PROFILING_COMMAND_QUEUED>()) / 1000000.0;
}
double OpenCLRuntime::getSubmitTime(const cl::Event *event){
mCommandQueuePtr->finish();
return (event->getProfilingInfo<CL_PROFILING_COMMAND_START>() - event->getProfilingInfo<CL_PROFILING_COMMAND_SUBMIT>()) / 1000000.0;
}