intel / idlf

Intel® Deep Learning Framework

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

error:regex_error

sunshineatnoon opened this issue · comments

when I try to run the demo using command below:
./visual_cloud_demo --config=gpu_caffenet.config
I got this error:
error: regex_error
My gpu_caffenet.config is as follows:
--model=caffenet_float
--device=device_gpu
--batch=32
--input=/home/images/
What does 'regex_error' mean?
Thanks!

Are you using the latest code? I believe this issue was fixed.

Jaroslaw Dukat, CPG Visual Cloud team manager, INTEL Technology Poland
Direct(PL): +48 58 766 1684, Inet: 8-348-1684, Cell +48 663 730 120

From: SunshineAtNoon [mailto:notifications@github.com]
Sent: Thursday, July 23, 2015 12:02
To: 01org/idlf
Subject: [idlf] error:regex_error (#2)

when I try to run the demo using command below:
./visual_cloud_demo --config=gpu_caffenet.config
I got this error:
error: regex_error
My gpu_caffenet.config is as follows:
--model=caffenet_float
--device=device_gpu
--batch=32
--input=/home/images/
What does 'regex_error' mean?
Thanks!

Reply to this email directly or view it on GitHubhttps://github.com//issues/2.

Intel Technology Poland sp. z o.o.
ul. Slowackiego 173 | 80-298 Gdansk | Sad Rejonowy Gdansk Polnoc | VII Wydzial Gospodarczy Krajowego Rejestru Sadowego - KRS 101882 | NIP 957-07-52-316 | Kapital zakladowy 200.000 PLN.

Ta wiadomosc wraz z zalacznikami jest przeznaczona dla okreslonego adresata i moze zawierac informacje poufne. W razie przypadkowego otrzymania tej wiadomosci, prosimy o powiadomienie nadawcy oraz trwale jej usuniecie; jakiekolwiek
przegladanie lub rozpowszechnianie jest zabronione.
This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). If you are not the intended recipient, please contact the sender and delete all copies; any review or distribution by
others is strictly prohibited.

Yes, I downloaded the code from Github today.

GCC below 4.9 version has an bug/missing funcionality in the regex (C++11).
We have already solved this problem and the solution will appear along with published changes in the coming days.

Thanks. If I update my GCC to 4.9, will this problem disappear?

Yes, the problem should disappear.

I updated GCC to 4.9.2 and then rebuild the framework, when I run make in /idlf-master/intel_visual_cloud_node/UnixMk/DebugULT/, I got a failure like this and the project stuck:

compiling convolving kernel: generic_convolve
reusing existing generic_convolve kernel
GPU Efficiency for generic_convolve(19) ( assuming GPU clock fixed to 1200 MHz ): inf %, 0.000[ms] sum: 0.00[ms]
GPU Efficiency for generic_convolve(20) ( assuming GPU clock fixed to 1200 MHz ): inf %, 0.000[ms] sum: 0.00[ms]
/home/Downloads/idlf-master/intel_visual_cloud_node/unit_tests/ult_gpu/test_cases/gpu_device_workflow_interface_0_functions.cpp:2676: Failure
Value of: verify_output( execute_outputs[0], cpu_outputs )
Actual: false
Expected: true
[ FAILED ] gpu_device_workflow_interface_0.multi_pass_interface_test (442 ms)
[ RUN ] gpu_device_workflow_interface_0.full_api_usage_test
Device: Intel(R) HD Graphics
vendor: Intel(R) Corporation
type: 4
extensions: cl_intel_accelerator cl_intel_advanced_motion_estimation cl_intel_motion_estimation cl_intel_subgroups cl_intel_va_api_media_sharing cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_icd cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_spir
device version: OpenCL 1.2
driver version: 16.4.2.1.39163
preferred vector width float: 1
global memory size: 1709598311
clock: 1200
compute_units: 20
local memory size: 65536
constant memory size: 65536
max_work_item_sizes: 512 x 512 x 512
max_work_group_size: 512
max_buffer_size: 427399577
availability: 1
max image2D width: 16384
max image2D height: 16384
compiling convolving kernel: generic_convolve
reusing existing generic_convolve kernel
GPU Efficiency for generic_convolve(21) ( assuming GPU clock fixed to 1200 MHz ): inf %, 0.000[ms] sum: 0.00[ms]
GPU Efficiency for generic_convolve(22) ( assuming GPU clock fixed to 1200 MHz ): inf %, 0.000[ms] sum: 0.00[ms]
/home/Downloads/idlf-master/intel_visual_cloud_node/unit_tests/ult_gpu/test_cases/gpu_device_workflow_interface_0_functions.cpp:2676: Failure
Value of: verify_output( execute_outputs[0], cpu_outputs )
Actual: false
Expected: true
compiling convolving kernel: generic_convolve
reusing existing generic_convolve kernel

Do you know how to solve this error?

Hi,
Could you provide more information about your system such as distribution name and exact Linux kernel version?
Are you using root account or regular user?
Also, if you could run ult_gpu application through strace and then provide the entire strace log - it might help us figure out where the problem is located. (ult_gpu should be located in DebugULT/bin)

Hi, Thanks for your reply.

  1. Here is the information about my centos:
    Linux version 3.10.0-229.7.2.el7.x86_64 (builder@kbuilder.dev.centos.org) (gcc version 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) ) #1 SMP Tue Jun 23 22:06:11 UTC 2015
  2. Here is the information about gcc:
    Using built-in specs.
    COLLECT_GCC=g++
    COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/x86_64-unknown-linux-gnu/4.9.2/lto-wrapper
    Target: x86_64-unknown-linux-gnu
    Configured with: ./configure --disable-multilib --enable-languages=c,c++
    Thread model: posix
    gcc version 4.9.2 (GCC)
  3. I've tried to make under /idlf-master/intel_visual_cloud_node/UnixMk/DebugULT/ both as sudo user and regular user, it gave me the same error above.
  4. I tried this command "strace -o ~/Downloads/log.txt ./ult_gpu"
    The output log.txt file is too long so I put it on the gist: https://gist.github.com/sunshineatnoon/c515817e4682bbf2d31d

Thank you for providing more details. I forwarded all the information to our OpenCL experts and I'll let you know when I get a response.
Meanwhile, I was informed that there is setup_inspector.py script included in Intel® Media Server Studio, also in Community Edition.
https://registrationcenter.intel.com/RegCenter/comform.aspx?productid=2411

This script verifies if there are any limitations/gaps in your system that may prevent OpenCL application from running correctly.
Here is an example of output from setup_inspector.py:

./setup_inspector.py
Starting setup inspector version 1.0.3
[ INFO ] Linux Kernel matches supported configuration - 3.10
[ INFO ] Render nodes are not enabled in the system
[ TIP ] Add kernel cmd line param - drm.rnodes=1 for va_media_sharing extension to be functional
[ INFO ] 64 bit archtecture supported
[ INFO ] Error: i915 module is not loaded
[ TIP ] Check if you have installed and booted correct kernel with installed i915.ko module from MSDK package
[ INFO ] Error: Linux Kernel i915 module does not contain vmap/userptr functionality
[ TIP ] Install matching i915.ko module from MSDK package
[ INFO ] libIntelOpenCL found in package - /opt/intel/opencl/libIntelOpenCL.so
[ INFO ] libIntelOpenCL found in package - /opt/intel/opencl/libIntelOpenCL.so.16
[ INFO ] libIntelOpenCL found in package - /opt/intel/opencl/libIntelOpenCL.so.16.4.0
[ INFO ] intel.icd contains icd entry: /opt/intel/opencl/libIntelOpenCL.so
[ INFO ] No nomodeset in kernel startup
[ INFO ] System has supported gcc version - 4.8.2
[ INFO ] System has supported glibc version - 2.17
[ INFO ] Required ocl package is installed in the system
[ INFO ] Linux OpenCL is NOT installed properly on this platform. Please rerun this script in verbose mode or follow TIPS to fix errors
[ FAILED ]

Also, please run the script with our test application:

./setup_inspector.py --debug=./ult_gpu
Starting setup inspector version 1.0.3
NN_Device_GPU ERROR[/mnt/android/intel_visual_cloud_node/devices/device_gpu/core/toolkit_opencl.cpp:113]: Error: No GPU OpenCL devices found!
-1

[ INFO ] Command executed. Parsing strace.log
[ TIP ] Execution log can be found in strace.log
[ TIP ] Errors listed in error.log

Thanks for your reply.
I have MediaServerStudioProfessional2015R6 installed, but I can't find setup_inspector.py, I will try to look for it again.
Meanwhile, I think maybe when the program crashes and gets stuck, it continues allocating memories and eating up all memory on GPU. Now my computer pops out CL_OUT_OF_RESOURCES error for any opencl program in function clEnqueueReadBuffer, I don't know for sure this is caused by the make crash, but you may check for memory leak as well.

Here's the path where the setup_inspector.py is located in the installation package:
MediaServerStudioEssentials2015R6 / SDK2015Production16.4.2.1.tar.gz / CentOS / intel-opencl-1.2-16.4.tar.gz

I'm not sure where it lands after installing the package in the system.

Did you follow exactly all the installation steps described in media_server_studio_getting_started_guide.pdf ? It looks like your kernel doesn't have all the required patches. One thing that is probably incorrect is this option missing in your kernel config: CONFIG_MMU_NOTIFIER=y

If setup_inspector doesn't report any error then your setup should be OK.

Could you upload the entire log from running ./ult_gpu (everything, not only the fragment where errors start showing up?)

Also, did you try to run ./visual_cloud_demo again? Those errors during the build suggest that something is wrong with running our test application (ult_gpu). We'll have a look at it, but meanwhile you could try running visual_cloud_demo anyway.

Sorry for the late reply, I was stuck in the error "CL_OUT_OF_RESOURCES" for any opencl program and had to reinstall new centos system on my computer. I notice there is a new release, does this one support gcc 4.8? If so, I will try this one instead. Also there is no such directory in the new released version: intel_visual_cloud_node but the instruction in devices/doc tells me to go to this directory.

Yes, support for GCC 4.8 was added with the new release.
Thank you for pointing out that the instruction is outdated - we'll take care of it. The scripts that create makefiles are currently located in the main directory.

In the old version, there are two scripts that create makefiles for both demo and model, but the new version only has one script to create makefile, I am confused what this makefile is for and how to build the model and demo.

This one script creates all the necessary makefiles for demo and devices. After running the script just go to UnixMk/Debug (or UnixMk/Release or UnixMk/DebugULT), type "make" and all the binaries will be built...

After the make command in UnixMk/Release, how to run the demo?

I was told that this instruction is a good reference although it also needs some changes:
https://github.com/01org/idlf/blob/master/demo/doc/instruction.txt

Anyway, you have all binaries in "bin" directory which is now located on the same level as your UnixMk. On my machine this command worked (run from bin directory):

./demo/device/gcc/Release/demo_device --config=cpu_caffenet.config

Remember to put weights in your "bin/weights_caffenet" directory. You'll find the weights here: https://01.org/intel-deep-learning-framework/downloads

cpu_caffenet.config doesn't use GPU, it runs on CPU, but CPU that supports AVX2 is required.

so I also don't have to run make install right? Because when I try to run make install under UnixMk/Release, it gives me "make: *** No rule to make target `install'. Stop."

Exactly, no installation is required or (at this point) supported.

Finally I can successfully test caffenet on a single image, thanks for your help!
Here is the result I got:
1
What does these two times and ticks mean? I only had one image in a batch and since testing only needs feedforward pass, why doesn't the batch time equal to the per single image?

Also, when I try to run on cpu it gives me this error:
error: Enable multithreading to use std::thread: Operation not permitted

Those times are different because batch parameter is set to 32 in your config file. Even though you have only one image in your input directory, the demo allocates buffer for 32 images and the entire buffer is processed as if there were 32 images in the buffer.
Set batch=1 and those two times will be equal.

I don't know what causes that multithreading error, I'll ask someone else to help you.

Thanks, so "per single image time" indicates the time it takes only for the feedforward pass for a single image?
And I also think it is strange for a single image in a 32 batch takes 196ms while in a full batch of 32 images the total time is also about 196ms. It seems that the feedforward pass of a image takes no time while all time was spent on something else.

"per single image time" is average time per image. The entire batch took 196ms to process, so time for one image is 196/32.
If you set batch to 32 then the network will process images in batches of 32, no matter how many images there are in the input directory. Even if you have just 1 image, the network will be triggered to process batch of 32 images. We don't adjust batch size according to number of images in the input directory. At least not in current implementation...

I am still confused here, so the network just measures time for a batch and then divides it by the batch size to get the time for per image? If so, does this framework do any image copies when the image number is less than the batch size? Because if it doesn't, then even the network is triggered to process batch of 32 images, it doesn't have images to feed to the network.
I also try for different batch sizes, it seems the time for per image decreases while the batch size increases. However, the time for per image stops decreasing when the batch size is equal to or bigger than 32, does this mean that my computer supports at most 32 work items(kernels) to run parallel at the same time or this number is set by your program?

I am still confused here, so the network just measures time for a batch and then divides it by the batch size to get the time for per image?

Yes.

If so, does this framework do any image copies when the image number is less than the batch size?

No, the rest of the buffer of images is zeroed if you have less images than the batch size. So if you have just one image and batch is 32, then the buffer is completed with 31 black images (all pixels have value 0).

I also try for different batch sizes, it seems the time for per image decreases while the batch size increases.

That's why batching is used - to get better "per image" performance.

However, the time for per image stops decreasing when the batch size is equal to or bigger than 32, does this mean that my computer supports at most 32 work items(kernels) to run parallel at the same time or this number is set by your program?

Batching helps achieve better utilization of hardware's compute capabilities when you are limited by the available memory bandwidth. But at some point increasing batch size doesn't increase performance, because hardware's utilization is already close to potential maximum.

Thanks for your patience and explanation, that really makes things clear now!

I'm sorry to bother you, but did you solve this error: "error: Enable multithreading to use std::thread: Operation not permitted". I googled it and found that I might solve it by adding -pthread to some compile commands, any suggestions?

sorry for later response. It seems that issue you are observing is to be fixed via "-pthread" compilation/linking option. However I do not have reproduction of you observation on any platforms I have an access to . So I would like to ask You for verification.

Please put following changes into VisualCloud/device/cpu/CMakeLists.txt:79

if(UNIX)
set_target_properties(${TARGET_NAME} PROPERTIES COMPILE_FLAGS -pthread )
target_link_libraries(${TARGET_NAME} -pthread)
endif(UNIX)

Those lines are adding -pthread for compilation as well as linking of device_cpu . Please regenerate makefiles (eg. create makefiles again), build and test if this sorts this issue out. Let us know about results You get.

@yetanotherdeveloper I have a feeling that I might have the honor to meet all your team members :), Thanks for your patience!
Unfortunately I tried another solution and added these two lines to the CMakeLists:
SET(CMAKE_CXX_FLAGS_DEBUG "-pthread")
SET(CMAKE_CXX_FLAGS_RELEASE "-pthread")
This is a terrible idea because now my computer pops out CL_OUT_OF_RESOURCES error for any OpenCL program.
This is the second time I ran into this problem when idlf failed and crushed when running make command in /idlf-master/UnixMk/DebugULT. The first time is when I upgraded GCC to 4.9 and try to make under /idlf-master/UnixMk/DebugULT. I think this might be a potential bug. And last time I can only fix this problem by reinstalling my CentOS and the whole OpenCL environment.

Also, there is no directory called VisualCloud in the current version, I suppose you mean idfl-master/device/cpu/CMakeLists.txt ?

I tried to change CMakeLists.txt in idfl-master/device/cpu/CMakeLists.txt, it now looks like this:

# Copyright (c) 2014, Intel Corporation
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are met:
# 
#     * Redistributions of source code must retain the above copyright notice,
#       this list of conditions and the following disclaimer.
#     * Redistributions in binary form must reproduce the above copyright
#       notice, this list of conditions and the following disclaimer in the
#       documentation and/or other materials provided with the distribution.
#     * Neither the name of Intel Corporation nor the names of its contributors
#       may be used to endorse or promote products derived from this software
#       without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
# CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.


set (DEVICE_NAME cpu)
set (TARGET_NAME device_${DEVICE_NAME})

project (${DEVICE_NAME})


# Source files
file (GLOB CORE_SRC
      "core/*.cpp"
      "core/*.h")

file (GLOB CORE_FIXEDPOINT_SRC
      "core/fixedpoint/*.cpp"
      "core/fixedpoint/*.h")

file (GLOB API_INTERNAL_SRC
      "api_internal/*.cpp" 
      "api_internal/*.h"
      )

file (GLOB DEVICE_CPU_COMMON_SRC
      "../common/nn_device_internal.h"
      "../common/nn_device_interface_0_internal_common.cpp"
      "../common/nn_workload_data.cpp"
      "../common/nn_workload_data.h")


file (GLOB DEVICE_API
      "../api/*.h")

# Create named folders for the sources within the .vcproj
# Empty name lists them directly under the .vcproj

source_group("core" FILES ${CORE_SRC})
source_group("core\\fixedpoint" FILES ${CORE_FIXEDPOINT_SRC})
source_group("api_internal" FILES ${API_INTERNAL_SRC})
source_group("api" FILES ${DEVICE_API})
source_group("common" FILES ${DEVICE_CPU_COMMON_SRC})


# Create .dll/.so in Release/Debug and .lib in DebugULT
if( CMAKE_BUILD_TYPE STREQUAL "Release" )
    add_library(${TARGET_NAME} MODULE ${CORE_SRC} ${CORE_FIXEDPOINT_SRC} ${API_INTERNAL_SRC} ${DEVICE_API} ${DEVICE_CPU_COMMON_SRC})
elseif( CMAKE_BUILD_TYPE STREQUAL "Debug" )
    add_library(${TARGET_NAME} MODULE ${CORE_SRC} ${CORE_FIXEDPOINT_SRC} ${API_INTERNAL_SRC} ${DEVICE_API} ${DEVICE_CPU_COMMON_SRC})
elseif( CMAKE_BUILD_TYPE STREQUAL "DebugULT" )
    add_library(${TARGET_NAME} STATIC ${CORE_SRC} ${CORE_FIXEDPOINT_SRC} ${API_INTERNAL_SRC} ${DEVICE_API} ${DEVICE_CPU_COMMON_SRC})
else()
    message("Unknown configuration")
endif()

target_link_libraries(${TARGET_NAME})

if(UNIX)
    set_target_properties(${TARGET_NAME} PROPERTIES COMPILE_FLAGS -pthread )
    target_link_libraries(${TARGET_NAME} -pthread)
endif(UNIX)

set_target_properties( ${TARGET_NAME}
                       PROPERTIES
                       LIBRARY_OUTPUT_DIRECTORY "${CMAKE_LIBRARY_OUTPUT_DIRECTORY}/device/${DEVICE_NAME}/${COMPILER_STR}${BUILD_TYPE}"
                     )
set_target_properties( ${TARGET_NAME}
                       PROPERTIES 
                       FOLDER device
                     )

set_target_properties(${TARGET_NAME} 
                      PROPERTIES 
                      PROJECT_LABEL ${DEVICE_NAME}) 

set (POST_BUILD_WORKING_DIRECTORY ${CMAKE_LIBRARY_OUTPUT_DIRECTORY}/device/${DEVICE_NAME}/${COMPILER_STR}${BUILD_TYPE}/${CMAKE_BUILD_TYPE}/)

if ((NOT UNIX) AND (NOT (${CMAKE_BUILD_TYPE} STREQUAL "DebugULT")))
  if (${CMAKE_BUILD_TYPE} STREQUAL "Release")
  add_custom_command(TARGET ${TARGET_NAME}
                     POST_BUILD
                     WORKING_DIRECTORY ${POST_BUILD_WORKING_DIRECTORY}
                     COMMAND "${CMAKE_COMMAND}" -E copy ${TARGET_NAME}.dll ${RUNTIME_BIN_DIRECTORY}
                     )
  else()
  add_custom_command(TARGET ${TARGET_NAME}
                     POST_BUILD
                     WORKING_DIRECTORY ${POST_BUILD_WORKING_DIRECTORY}
                     COMMAND "${CMAKE_COMMAND}" -E copy ${TARGET_NAME}.dll ${RUNTIME_BIN_DIRECTORY}
                     COMMAND "${CMAKE_COMMAND}" -E copy ${TARGET_NAME}.pdb ${RUNTIME_BIN_DIRECTORY}
                     )
  endif()
endif()

Then I regenerate makefiles , build and test(all tests passed), then run cpu mode by

./demo/device/gcc/Release/demo_device --config=cpu_caffenet.config

it still gives me same error:"error: Enable multithreading to use std::thread: Operation not permitted". The gpu mode can be ran successfully.

I found a solution: http://stackoverflow.com/questions/20568235/using-c11-multithreading-in-shared-library-loaded-by-programm-without-thread-s
By export LD_PRELOAD=/lib64/libpthread.so.0, I can now run the CPU mode.

Yet I found another problem is that when the batch size is one, the cpu mode works well and gives me this result for a single image:
Recognitions:

14.8% [n03877472 ] pajama, pyjama, pj's, jammies
7.2% [n04317175 ] stethoscope
6.6% [n04099969 ] rocking chair, rocker
4.7% [n04597913 ] wooden spoon
3.4% [n07615774 ] ice lolly, lolly, lollipop, popsicle

However, when the batch size is equal to or bigger than two, it gives me this recognition result for the same image(actually all images):
Recognitions:

0.0% [n01440764 ] tench, Tinca tinca
0.0% [n01440764 ] tench, Tinca tinca
0.0% [n01440764 ] tench, Tinca tinca
0.0% [n01440764 ] tench, Tinca tinca
0.0% [n01440764 ] tench, Tinca tinca

It seems only batch size = 1 or 48 works, why?
The GPU mode works fine for batch size greater than 2.

Hi, Thanks for your feedback in particular related to how to make device CPU running.
As for your question about supported batches. As it was mentioned using batch mode speeds up processing, helps to hide latency etc. As you noticed performance results are better when increasing batch , up to some specific value. This value is totally implementation dependent eg. device CPU
currently supports batch: 1,8,48 where 48 is performing best. device GPU supports 1,8,16,32 where batch 32 performs best. If you choose batch different batch you may experience low performance or or in some cases faulty behavior.

But why the specific number 1,8 and 48. When testing on GPU, any batch size works fine. But CPU can only take batch size 1,8,48. Is this limited by something like the way the images are paralleled?
This is not a big deal, I am just curious :)

To be more precise , device CPU is using AVX2 intrinsics (Intel Advanced Vector extensions) which is introduced at Haswell processor. This extension introduce number of SSE like intructions ( multipliations, additions, dot products of vectors). Also new registers YMM are introduced , there is limited number of them , and so implementation takes advantage of this very limited asset. Please look into source for more details

@yetanotherdeveloper Thanks for your detailed explanation, I will check the source.

Hi, I tested caffenet and the time costs to classify an image is about 56ms for a single image on cpu and 67ms for a single image on gpu, which is crazy fast compared to other frameworks' test results(the fastest I ever saw is about 20ms for a single image classification), do you get the similar result as I did? Or did I make some mistakes and get the wrong time?

It's OK, we aimed at "crazy fast" ;-)
Your results are not surprising. The results depend on your CPU core count, CPU clock speed and memory bandwidth. In case of GPU the results depend on GPU model, number of execution units and GPU frequency.

I want to test efficiency for single layer, thus I run debug model and get these outputs:

GPU Efficiency for                arithmetic ( assuming GPU clock fixed to 0 MHz ):  inf %,  1.532[ms]  sum: 1.53[ms]
GPU Efficiency for                   pooling ( assuming GPU clock fixed to 0 MHz ):  -nan %,  1.590[ms]  sum: 3.12[ms]
GPU Efficiency for    convolve_AlexNet_C1(1) ( assuming GPU clock fixed to 0 MHz ):  inf %,  15.258[ms]  sum: 18.38[ms]
GPU Efficiency for       convolve_simd16(2") ( assuming GPU clock fixed to 0 MHz ):  inf %,  1.566[ms]  sum: 19.95[ms]
GPU Efficiency for       convolve_simd16(2') ( assuming GPU clock fixed to 0 MHz ):  inf %,  1.419[ms]  sum: 21.36[ms]
GPU Efficiency for        convolve_simd16(2) ( assuming GPU clock fixed to 0 MHz ):  inf %,  9.673[ms]  sum: 31.04[ms]
GPU Efficiency for             normalization ( assuming GPU clock fixed to 0 MHz ):  -nan %,  0.716[ms]  sum: 31.75[ms]

What does these number mean? Is this time for a layer or a single convolution operation? Also, why there are 3 different convolutions for 8 layers' convolution? I remember Alexnet only has 5 layers including convolution.

Those are times of OpenCL kernels executions. The last columns sums up all the kernel times, so you'll have the total time in the last such line.
Sometimes there is one kernel per layer, sometimes there are more.
We use those results to calculate efficiency of the kernels. On your machine something went wrong with reading GPU model and/or frequency. At some point in Debug version we print all the information that we can get about GPU and OpenCL - you may want to check what was read incorrectly.

Then why is there 8 layers of convolution kernels and 3 convolution kernels for the convolve function, is it possible for me to know how much time each layer needs so that I can compare this network to other networks? Something like this:

conv1   forward: 3.30445 ms.
pool1   forward: 1.641 ms.
norm1   forward: 3.564 ms.

Yes, it's possible to know it.
You must remember that in this topology convolutions 2, 4 and 5 are split in two parts. So here's the mapping:
conv1 - kernel 1
conv2 - 1st part: kernels 2,2',2''. 2nd part: kernels 3, 3',3''
conv3 - kernels 4,4',4''

and so on, as far as I remember.
Sometimes there are 3 kernels per layer (or per one part of the layer) to improve efficiency. There's no point of explaining this unless you dive very deep into OpenCL code...

So if I get something like this:

GPU Efficiency for    convolve_AlexNet_C1(1) ( assuming GPU clock fixed to 0 MHz ):  inf %,  15.258[ms]  sum: 18.38[ms]
GPU Efficiency for       convolve_simd16(2") ( assuming GPU clock fixed to 0 MHz ):  inf %,  1.566[ms]  sum: 19.95[ms]
GPU Efficiency for       convolve_simd16(2') ( assuming GPU clock fixed to 0 MHz ):  inf %,  1.419[ms]  sum: 21.36[ms]
GPU Efficiency for        convolve_simd16(2) ( assuming GPU clock fixed to 0 MHz ):  inf %,  9.673[ms]  sum: 31.04[ms]

Then the first convolution layer costs 15.258ms,right?
And the second convolution layer costs 1.566+1.419+9.673?
There are five layers of convolutions in Alexnet, where can I find the rest mapping?

Second convolution has two parts, you'll have to add also kernels 3,3' and 3'' to have the total time of convolution 2.
If you put here the entire "GPU efficiency" log, I'll tell you what is what.

Thanks!
This is the log file:

reusing existing convolve_simd8 kernel
GPU Efficiency for                arithmetic ( assuming GPU clock fixed to 0 MHz ):  inf %,  1.532[ms]  sum: 1.53[ms]
GPU Efficiency for                   pooling ( assuming GPU clock fixed to 0 MHz ):  -nan %,  1.590[ms]  sum: 3.12[ms]
GPU Efficiency for    convolve_AlexNet_C1(1) ( assuming GPU clock fixed to 0 MHz ):  inf %,  15.258[ms]  sum: 18.38[ms]
GPU Efficiency for       convolve_simd16(2") ( assuming GPU clock fixed to 0 MHz ):  inf %,  1.566[ms]  sum: 19.95[ms]
GPU Efficiency for       convolve_simd16(2') ( assuming GPU clock fixed to 0 MHz ):  inf %,  1.419[ms]  sum: 21.36[ms]
GPU Efficiency for        convolve_simd16(2) ( assuming GPU clock fixed to 0 MHz ):  inf %,  9.673[ms]  sum: 31.04[ms]
GPU Efficiency for             normalization ( assuming GPU clock fixed to 0 MHz ):  -nan %,  0.716[ms]  sum: 31.75[ms]
reusing existing convolve_simd8 kernel
reusing existing convolve_simd8 kernel
entering prepare_conv_kernel
reusing existing convolve_simd8 kernel
GPU Efficiency for        convolve_simd16(3) ( assuming GPU clock fixed to 0 MHz ):  inf %,  9.486[ms]  sum: 41.24[ms]
GPU Efficiency for       convolve_simd16(3') ( assuming GPU clock fixed to 0 MHz ):  inf %,  1.405[ms]  sum: 42.65[ms]
GPU Efficiency for       convolve_simd16(3") ( assuming GPU clock fixed to 0 MHz ):  inf %,  1.567[ms]  sum: 44.21[ms]
GPU Efficiency for                   pooling ( assuming GPU clock fixed to 0 MHz ):  -nan %,  0.965[ms]  sum: 45.18[ms]
GPU Efficiency for             normalization ( assuming GPU clock fixed to 0 MHz ):  -nan %,  0.431[ms]  sum: 45.61[ms]
GPU Efficiency for         convolve_simd8(4) ( assuming GPU clock fixed to 0 MHz ):  inf %,  5.377[ms]  sum: 50.98[ms]
GPU Efficiency for        convolve_simd8(4') ( assuming GPU clock fixed to 0 MHz ):  inf %,  4.439[ms]  sum: 55.42[ms]
GPU Efficiency for        convolve_simd8(4") ( assuming GPU clock fixed to 0 MHz ):  inf %,  8.882[ms]  sum: 64.31[ms]
reusing existing convolve_simd8 kernel
reusing existing pooling kernel
reusing existing fully_connected_8x8 kernel
reusing existing fully_connected_8x8 kernel
reusing existing fully_connected_8x8 kernel
GPU Efficiency for         convolve_simd8(5) ( assuming GPU clock fixed to 0 MHz ):  inf %,  1.987[ms]  sum: 66.29[ms]
GPU Efficiency for        convolve_simd8(7') ( assuming GPU clock fixed to 0 MHz ):  inf %,  1.141[ms]  sum: 67.43[ms]
GPU Efficiency for        convolve_simd8(6') ( assuming GPU clock fixed to 0 MHz ):  inf %,  1.748[ms]  sum: 69.18[ms]
GPU Efficiency for        convolve_simd8(6") ( assuming GPU clock fixed to 0 MHz ):  inf %,  3.884[ms]  sum: 73.07[ms]
GPU Efficiency for        convolve_simd8(5') ( assuming GPU clock fixed to 0 MHz ):  inf %,  1.785[ms]  sum: 74.85[ms]
GPU Efficiency for         convolve_simd8(6) ( assuming GPU clock fixed to 0 MHz ):  inf %,  2.055[ms]  sum: 76.90[ms]
GPU Efficiency for         convolve_simd8(7) ( assuming GPU clock fixed to 0 MHz ):  inf %,  1.350[ms]  sum: 78.25[ms]
GPU Efficiency for        convolve_simd8(5") ( assuming GPU clock fixed to 0 MHz ):  inf %,  3.375[ms]  sum: 81.63[ms]
reusing existing softmax kernel
GPU Efficiency for        convolve_simd8(7") ( assuming GPU clock fixed to 0 MHz ):  inf %,  2.215[ms]  sum: 83.84[ms]
GPU Efficiency for       fully_connected_8x8 ( assuming GPU clock fixed to 0 MHz ):  inf %,  0.909[ms]  sum: 84.75[ms]
GPU Efficiency for        convolve_simd8(8") ( assuming GPU clock fixed to 0 MHz ):  inf %,  2.271[ms]  sum: 87.02[ms]
GPU Efficiency for       fully_connected_8x8 ( assuming GPU clock fixed to 0 MHz ):  inf %,  8.311[ms]  sum: 95.34[ms]
GPU Efficiency for                   pooling ( assuming GPU clock fixed to 0 MHz ):  -nan %,  0.270[ms]  sum: 95.61[ms]
GPU Efficiency for       fully_connected_8x8 ( assuming GPU clock fixed to 0 MHz ):  inf %,  3.596[ms]  sum: 99.20[ms]
GPU Efficiency for         convolve_simd8(8) ( assuming GPU clock fixed to 0 MHz ):  inf %,  1.351[ms]  sum: 100.55[ms]
GPU Efficiency for        convolve_simd8(8') ( assuming GPU clock fixed to 0 MHz ):  inf %,  1.147[ms]  sum: 101.70[ms]
GPU Efficiency for                   softmax ( assuming GPU clock fixed to 0 MHz ):  -nan %,  0.018[ms]  sum: 101.72[ms]

It seems that the cpu debug mode doesn't have any outputs about time for each layer, is it possible for me to test the time for each layer in cpu mode?

@jreniecki Hi~ Sorry to bother you but could you please help me figure out time for each layer and the time calculated above is for a batch or single image?

 Convolution 1
       convolve_AlexNet_C1(1)  15.258[ms]  

 Convolution 2
       convolve_simd16(2")     1.566[ms]  
       convolve_simd16(2')     1.419[ms]  
       convolve_simd16(2)      9.673[ms]  
       convolve_simd16(3)      9.486[ms]  
       convolve_simd16(3')     1.405[ms]  
       convolve_simd16(3")     1.567[ms]  

 Convolution 3
       convolve_simd8(4)       5.377[ms]  
       convolve_simd8(4')      4.439[ms]  
       convolve_simd8(4")      8.882[ms]  

Convolution 4
       convolve_simd8(5)       1.987[ms]  
       convolve_simd8(5')      1.785[ms]  
       convolve_simd8(5")      3.375[ms]  
       convolve_simd8(6')      1.748[ms]  
       convolve_simd8(6")      3.884[ms]  
       convolve_simd8(6)       2.055[ms]  

convolution 5
       convolve_simd8(7)       1.350[ms]  
       convolve_simd8(7')      1.141[ms]  
       convolve_simd8(7")      2.215[ms]  
       convolve_simd8(8)       1.351[ms]  
       convolve_simd8(8')      1.147[ms]  
       convolve_simd8(8")      2.271[ms]  

Those times are for entire batch.