amd / OpenCL-caffe

This is a Experimental version of OpenCL by AMD Research, we now recommend you to use The official BVLC Caffe OpenCL branch is over at Caffe branch now at https://github.com/BVLC/caffe/tree/opencl

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

test_gradient_based_solver fails

mpekalski opened this issue · comments

I have a problem with make runtest failing on SGDSolver and NesterovSolver. I looked at the git repository of BVLC/caffe (BVLC/caffe#3109) and there somebody was referring to a problem coming from the same file test_gradient_based_solver.cpp. In the comments people were writing that it was caused by multiple GPUs present in the system or the fact that Intel MKL's float point operations (such as matrix multiplication) are non-deterministic by default.

Regarding my system, I am running Caffe cloned from github on 22nd of December 2015 on Ubuntu 15.10 with Radeon R9 290 (4GB) and i7-4770K CPU @ 3.50GHz, AMDAPPSDK-3.0. Four tests failed.

If anybody knows how to make them pass or what causes the problem it would be great.

$ fglrxinfo
display: :0  screen: 0
OpenGL vendor string: Advanced Micro Devices, Inc.
OpenGL renderer string: AMD Radeon R9 200 Series
OpenGL version string: 4.5.13399 Compatibility Profile Context 15.201.1151
/OpenCL-caffe/build$ make runtest
[  1%] Built target proto
[ 55%] Built target caffe
[ 55%] Built target gtest
[100%] Built target test.testbin
Current device id: 0
Note: Google Test filter = -*GPU*
Note: Randomizing tests' orders with a seed of 3162 .
...................
src/caffe/test/test_gradient_based_solver.cpp:308: Failure
The difference between expected_param and accum_param is 0.00015804916620254517, which exceeds error_margin, where
expected_param evaluates to 0.013755172491073608,
accum_param evaluates to 0.013913221657276154, and
error_margin evaluates to 0.00013755173131357878.
src/caffe/test/test_gradient_based_solver.cpp:308: Failure
The difference between expected_param and accum_param is 0.00015809759497642517, which exceeds error_margin, where
expected_param evaluates to 0.00093255564570426941,
accum_param evaluates to 0.0010906532406806946, and
error_margin evaluates to 9.3255566753214225e-06.
src/caffe/test/test_gradient_based_solver.cpp:308: Failure
The difference between expected_param and accum_param is 0.00015811249613761902, which exceeds error_margin, where
expected_param evaluates to -0.010214578360319138,
accum_param evaluates to -0.010056465864181519, and
error_margin evaluates to 0.00010056465544039384.
src/caffe/test/test_gradient_based_solver.cpp:308: Failure
The difference between expected_param and accum_param is 0.00015810877084732056, which exceeds error_margin, where
expected_param evaluates to -0.0109538733959198,
accum_param evaluates to -0.010795764625072479, and
error_margin evaluates to 0.0001079576468328014.
src/caffe/test/test_gradient_based_solver.cpp:315: Failure
The difference between expected_bias and accum_bias is 0.020383119583129883, which exceeds error_margin, where
expected_bias evaluates to -1.2308368682861328,
accum_bias evaluates to -1.2512199878692627, and
error_margin evaluates to 0.01230836845934391.
[  FAILED  ] SGDSolverTest/2.TestLeastSquaresUpdateWithEverythingAccum, where TypeParam = caffe::GPUDevice<float> (72 ms)
.....................
src/caffe/test/test_gradient_based_solver.cpp:315: Failure
The difference between expected_bias and accum_bias is 16.290860759216979, which exceeds error_margin, where
expected_bias evaluates to 30.680854565170151,
accum_bias evaluates to 14.389993805953171, and
error_margin evaluates to 0.14389993805953172.
[  FAILED  ] NesterovSolverTest/3.TestLeastSquaresUpdateWithEverythingAccum, where TypeParam = caffe::GPUDevice<double> (12 ms)
.....................
src/caffe/test/test_gradient_based_solver.cpp:308: Failure
The difference between expected_param and accum_param is 0.0009438430034522105, which exceeds error_margin, where
expected_param evaluates to -0.0075809667410185466,
accum_param evaluates to -0.0085248097444707571, and
error_margin evaluates to 7.5809667410185473e-05.
src/caffe/test/test_gradient_based_solver.cpp:308: Failure
The difference between expected_param and accum_param is 0.00094384300345218275, which exceeds error_margin, where
expected_param evaluates to 0.073903926620193378,
accum_param evaluates to 0.072960083616741195, and
error_margin evaluates to 0.00072960083616741196.
src/caffe/test/test_gradient_based_solver.cpp:308: Failure
The difference between expected_param and accum_param is 0.00094384300345218275, which exceeds error_margin, where
expected_param evaluates to 0.057955563277547251,
accum_param evaluates to 0.057011720274095068, and
error_margin evaluates to 0.00057011720274095074.
src/caffe/test/test_gradient_based_solver.cpp:308: Failure
The difference between expected_param and accum_param is 0.00094384300345218275, which exceeds error_margin, where
expected_param evaluates to -0.040239520053002617,
accum_param evaluates to -0.0411833630564548, and
error_margin evaluates to 0.00040239520053002618.
src/caffe/test/test_gradient_based_solver.cpp:308: Failure
The difference between expected_param and accum_param is 0.00094384300345230765, which exceeds error_margin, where
expected_param evaluates to -0.054015869066066119,
accum_param evaluates to -0.054959712069518427, and
error_margin evaluates to 0.00054015869066066115.
src/caffe/test/test_gradient_based_solver.cpp:308: Failure
The difference between expected_param and accum_param is 0.00094384300345244643, which exceeds error_margin, where
expected_param evaluates to -0.035164120485524253,
accum_param evaluates to -0.036107963488976699, and
error_margin evaluates to 0.00035164120485524255.
src/caffe/test/test_gradient_based_solver.cpp:308: Failure
The difference between expected_param and accum_param is 0.00094384300345243255, which exceeds error_margin, where
expected_param evaluates to 0.023867411055095877,
accum_param evaluates to 0.022923568051643445, and
error_margin evaluates to 0.00022923568051643446.
src/caffe/test/test_gradient_based_solver.cpp:308: Failure
The difference between expected_param and accum_param is 0.00094384300345229377, which exceeds error_margin, where
expected_param evaluates to -0.023262755953808445,
accum_param evaluates to -0.024206598957260739, and
error_margin evaluates to 0.00023262755953808447.
src/caffe/test/test_gradient_based_solver.cpp:308: Failure
The difference between expected_param and accum_param is 0.0009438430034522105, which exceeds error_margin, where
expected_param evaluates to 0.022473440985742582,
accum_param evaluates to 0.021529597982290372, and
error_margin evaluates to 0.00021529597982290373.
src/caffe/test/test_gradient_based_solver.cpp:308: Failure
The difference between expected_param and accum_param is 0.00094384300345204397, which exceeds error_margin, where
expected_param evaluates to -0.094111263940639428,
accum_param evaluates to -0.095055106944091472, and
error_margin evaluates to 0.00094111263940639428.
src/caffe/test/test_gradient_based_solver.cpp:308: Failure
The difference between expected_param and accum_param is 0.00094384300345215499, which exceeds error_margin, where
expected_param evaluates to -0.052540014489286579,
accum_param evaluates to -0.053483857492738734, and
error_margin evaluates to 0.0005254001448928658.
src/caffe/test/test_gradient_based_solver.cpp:308: Failure
The difference between expected_param and accum_param is 0.0009438430034522105, which exceeds error_margin, where
expected_param evaluates to -0.074735299450877965,
accum_param evaluates to -0.075679142454330176, and
error_margin evaluates to 0.00074735299450877965.
src/caffe/test/test_gradient_based_solver.cpp:308: Failure
The difference between expected_param and accum_param is 0.0009438430034522105, which exceeds error_margin, where
expected_param evaluates to -0.032339898640959791,
accum_param evaluates to -0.033283741644412002, and
error_margin evaluates to 0.00032339898640959791.
src/caffe/test/test_gradient_based_solver.cpp:308: Failure
The difference between expected_param and accum_param is 0.00094384300345218275, which exceeds error_margin, where
expected_param evaluates to 0.089805274972429433,
accum_param evaluates to 0.08886143196897725, and
error_margin evaluates to 0.00088861431968977251.
src/caffe/test/test_gradient_based_solver.cpp:308: Failure
The difference between expected_param and accum_param is 0.00094384300345243255, which exceeds error_margin, where
expected_param evaluates to 0.079982388023672302,
accum_param evaluates to 0.079038545020219869, and
error_margin evaluates to 0.00079038545020219866.
src/caffe/test/test_gradient_based_solver.cpp:308: Failure
The difference between expected_param and accum_param is 0.00094384300345229377, which exceeds error_margin, where
expected_param evaluates to -0.046441101129402113,
accum_param evaluates to -0.047384944132854406, and
error_margin evaluates to 0.00046441101129402113.
src/caffe/test/test_gradient_based_solver.cpp:308: Failure
The difference between expected_param and accum_param is 0.00094384300345215499, which exceeds error_margin, where
expected_param evaluates to 0.073243776031034963,
accum_param evaluates to 0.072299933027582808, and
error_margin evaluates to 0.00072299933027582806.
src/caffe/test/test_gradient_based_solver.cpp:308: Failure
The difference between expected_param and accum_param is 0.00094384300345218275, which exceeds error_margin, where
expected_param evaluates to 0.036783959999803301,
accum_param evaluates to 0.035840116996351118, and
error_margin evaluates to 0.0003584011699635112.
src/caffe/test/test_gradient_based_solver.cpp:308: Failure
The difference between expected_param and accum_param is 0.00094384300345218275, which exceeds error_margin, where
expected_param evaluates to 0.074800868345276994,
accum_param evaluates to 0.073857025341824811, and
error_margin evaluates to 0.00073857025341824811.
src/caffe/test/test_gradient_based_solver.cpp:308: Failure
The difference between expected_param and accum_param is 0.00094384300345229377, which exceeds error_margin, where
expected_param evaluates to -0.087903557731515064,
accum_param evaluates to -0.088847400734967358, and
error_margin evaluates to 0.00087903557731515067.
src/caffe/test/test_gradient_based_solver.cpp:308: Failure
The difference between expected_param and accum_param is 0.0009438430034522105, which exceeds error_margin, where
expected_param evaluates to -0.09313550033631654,
accum_param evaluates to -0.094079343339768751, and
error_margin evaluates to 0.00093135500336316546.
src/caffe/test/test_gradient_based_solver.cpp:308: Failure
The difference between expected_param and accum_param is 0.00094384300345232153, which exceeds error_margin, where
expected_param evaluates to 0.0023059492350265554,
accum_param evaluates to 0.0013621062315742338, and
error_margin evaluates to 1.3621062315742339e-05.
src/caffe/test/test_gradient_based_solver.cpp:308: Failure
The difference between expected_param and accum_param is 0.0009438430034522105, which exceeds error_margin, where
expected_param evaluates to 0.018183528710792568,
accum_param evaluates to 0.017239685707340358, and
error_margin evaluates to 0.00017239685707340358.
src/caffe/test/test_gradient_based_solver.cpp:308: Failure
The difference between expected_param and accum_param is 0.00094384300345237704, which exceeds error_margin, where
expected_param evaluates to 0.040265922015119665,
accum_param evaluates to 0.039322079011667288, and
error_margin evaluates to 0.00039322079011667289.
src/caffe/test/test_gradient_based_solver.cpp:308: Failure
The difference between expected_param and accum_param is 0.00094384300345218275, which exceeds error_margin, where
expected_param evaluates to 0.091959157638004163,
accum_param evaluates to 0.09101531463455198, and
error_margin evaluates to 0.00091015314634551982.
src/caffe/test/test_gradient_based_solver.cpp:315: Failure
The difference between expected_bias and accum_bias is 0.11647201251983841, which exceeds error_margin, where
expected_bias evaluates to -1.9010368863451557,
accum_bias evaluates to -1.7845648738253173, and
error_margin evaluates to 0.017845648738253173.
[  FAILED  ] SGDSolverTest/3.TestLeastSquaresUpdateWithEverythingAccum, where TypeParam = caffe::GPUDevice<double> (16 ms)
.....................
src/caffe/test/test_gradient_based_solver.cpp:308: Failure
The difference between expected_param and accum_param is 0.057497024536132812, which exceeds error_margin, where
expected_param evaluates to -5.7565555572509766,
accum_param evaluates to -5.6990585327148438, and
error_margin evaluates to 0.056990586221218109.
src/caffe/test/test_gradient_based_solver.cpp:308: Failure
The difference between expected_param and accum_param is 0.057497501373291016, which exceeds error_margin, where
expected_param evaluates to -5.6071248054504395,
accum_param evaluates to -5.5496273040771484, and
error_margin evaluates to 0.055496271699666977.
src/caffe/test/test_gradient_based_solver.cpp:308: Failure
The difference between expected_param and accum_param is 0.057498931884765625, which exceeds error_margin, where
expected_param evaluates to -5.7071495056152344,
accum_param evaluates to -5.6496505737304688, and
error_margin evaluates to 0.056496504694223404.
src/caffe/test/test_gradient_based_solver.cpp:308: Failure
The difference between expected_param and accum_param is 0.057498931884765625, which exceeds error_margin, where
expected_param evaluates to -5.3606395721435547,
accum_param evaluates to -5.3031406402587891, and
error_margin evaluates to 0.053031407296657562.
src/caffe/test/test_gradient_based_solver.cpp:315: Failure
The difference between expected_bias and accum_bias is 4.0414938926696777, which exceeds error_margin, where
expected_bias evaluates to -9.2879724502563477,
accum_bias evaluates to -5.2464785575866699, and
error_margin evaluates to 0.052464786916971207.
[  FAILED  ] NesterovSolverTest/2.TestLeastSquaresUpdateWithEverythingAccum, where TypeParam = caffe::GPUDevice<float> (19 ms)
....................
[  FAILED  ] 4 tests, listed below:
[  FAILED  ] SGDSolverTest/2.TestLeastSquaresUpdateWithEverythingAccum, where TypeParam = caffe::GPUDevice<float>
[  FAILED  ] SGDSolverTest/3.TestLeastSquaresUpdateWithEverythingAccum, where TypeParam = caffe::GPUDevice<double>
[  FAILED  ] NesterovSolverTest/2.TestLeastSquaresUpdateWithEverythingAccum, where TypeParam = caffe::GPUDevice<float>
[  FAILED  ] NesterovSolverTest/3.TestLeastSquaresUpdateWithEverythingAccum, where TypeParam = caffe::GPUDevice<double>

I just found out that clinfo shows that I have two devices (GPUs?) although physically I have one. Maybe that is the reason for the FAILED tests above.

But how to force the tests to use only one device?

$ clinfo
Number of platforms:                 1
  Platform Profile:              FULL_PROFILE
  Platform Version:              OpenCL 2.0 AMD-APP (1800.8)
  Platform Name:                 AMD Accelerated Parallel Processing
  Platform Vendor:               Advanced Micro Devices, Inc.
  Platform Extensions:               cl_khr_icd cl_amd_event_callback cl_amd_offline_devices 

  Platform Name:                 AMD Accelerated Parallel Processing
Number of devices:               2
  Device Type:                   CL_DEVICE_TYPE_GPU

AMD platform includes CPU and GPU devices. It’s up to the application to choose the appropriate device, but, I imagine, it should be defaulting to the GPU device.

From: Marcin Pękalski [mailto:notifications@github.com]
Sent: Friday, December 25, 2015 6:04 PM
To: amd/OpenCL-caffe OpenCL-caffe@noreply.github.com
Subject: Re: [OpenCL-caffe] test_gradient_based_solver fails (#22)

I just found out that clinfo shows that I have two devices (GPUs?) although physically I have one. Maybe that is the reason for the FAILED tests above.

But how to force the tests to use only one device?

$ clinfo

Number of platforms: 1

Platform Profile: FULL_PROFILE

Platform Version: OpenCL 2.0 AMD-APP (1800.8)

Platform Name: AMD Accelerated Parallel Processing

Platform Vendor: Advanced Micro Devices, Inc.

Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices

Platform Name: AMD Accelerated Parallel Processing

Number of devices: 2

Device Type: CL_DEVICE_TYPE_GPU


Reply to this email directly or view it on GitHubhttps://github.com//issues/22#issuecomment-167269066.

Is it possible to limit number of visible devices by setting some env variable?
Like in case of nVidia one can do it with export CUDA_VISIBLE_DEVICES=0.
For more details see BVLC/caffe#2926

Bystander observation: I would think that it would be better to choose specifically which GPU to use, rather than to choose how many GPUs to use.

I suspect that Marcin only has a single GPU, but there are two devices: CPU + GPU.

Marcin,

If you want to disable the CPU device, you can set the environment variable CPU_MAX_COMPUTE_UNITS to 0, but I don’t think it will fix your problem. Can you make sure you are running the latest drivers from AMD? Version 1800 seems to be about 6 months old.

Jeff

From: Hugh Perkins [mailto:notifications@github.com]
Sent: Friday, December 25, 2015 8:43 PM
To: amd/OpenCL-caffe OpenCL-caffe@noreply.github.com
Cc: Golds, Jeff Jeffrey.Golds@amd.com
Subject: Re: [OpenCL-caffe] test_gradient_based_solver fails (#22)

Bystander observation: I would think that it would be better to choose specifically which GPU to use, rather than to choose how many GPUs to use.


Reply to this email directly or view it on GitHubhttps://github.com//issues/22#issuecomment-167274960.

I also have this issue.

Ok guys. Hold on, we will look into this soon.
Junli

Sent from my iPhone

On Jan 8, 2016, at 7:40 PM, Aeium notifications@github.com wrote:

I also have this issue.


Reply to this email directly or view it on GitHub.

May it be related to some failing tests in clBLAS?

Well, this time I made sure clBLAS had passed test-functional and test-short before I tried installing caffe, but I just tried running some of those tests again and they don't work anymore. I'm honestly not really sure if this means caffe is breaking clBLAS or if there is some user error on my part here.

Initialize OpenCL and clblas...
---- Advanced Micro Devices, Inc.
SetUp: about to create command queues
[==========] Running 715 tests from 5 test cases.
[----------] Global test environment set-up.
[----------] 203 tests from ERROR
[ RUN ] ERROR.InvalidCommandQueue
OpenCL error -36 on line 350 of /jenkins/workspace/workspace/Build_Linux_Master_clBLAS/Bitness/64/Configuration/Release/label/acml-build-lin2/src/library/blas/xgemm.cc
Segmentation fault (core dumped)
nathan@amdRig14://home/nathan/clBLAS/build/staging$

I have another system with a more minimal installation of clBLAS and caffe, i'm going to switch to that and see if clBLAS is still working there.

This clBLAS error you have there from test-functional has been fixed in the latest develop (PR #214) branch by Timmy Liu. It breaks further on, but the issue is kind of the same that the method returns instead of throwing an error or sth like that.

Right, I think I was trying to test the wrong version of clBLAS on that system. Where I am sitting now, I only have the current develop version, and I get this output:

./test-functional

[----------] 136 tests from QUEUE (67714 ms total)

[----------] Global test environment tear-down
[==========] 715 tests from 5 test cases ran. (330057 ms total)
[ PASSED ] 714 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] THREAD.sgemm

I don't recall any of these failing the first time I ran this test. I'm trying to reinstall clBLAS and now i'm getting this issue:

Linking Fortran executable ../staging/test-correctness
CMakeFiles/test-correctness.dir/correctness/blas-lapack.c.o: In function cdotu': /home/nathan/Downloads/clBLAS/src/tests/correctness/blas-lapack.c:658: undefined reference tocdotusub_'
CMakeFiles/test-correctness.dir/correctness/blas-lapack.c.o: In function zdotu': /home/nathan/Downloads/clBLAS/src/tests/correctness/blas-lapack.c:673: undefined reference tozdotusub_'
CMakeFiles/test-correctness.dir/correctness/blas-lapack.c.o: In function cdotc': /home/nathan/Downloads/clBLAS/src/tests/correctness/blas-lapack.c:688: undefined reference tocdotcsub_'
CMakeFiles/test-correctness.dir/correctness/blas-lapack.c.o: In function zdotc': /home/nathan/Downloads/clBLAS/src/tests/correctness/blas-lapack.c:703: undefined reference tozdotcsub_'
collect2: error: ld returned 1 exit status
make[2]: *** [staging/test-correctness] Error 1
make[1]: *** [tests/CMakeFiles/test-correctness.dir/all] Error 2

So, I think what I have now is similar to these issues:
clMathLibraries/clBLAS#184
clMathLibraries/clBLAS#142

I remember the clBLAS test-functional and test-short worked before I installed caffe, but I installed more blas libraries between when the clBLAS test worked, and when I did the runtest for caffe. Atlas for example.

This clBLAS issue seems to be caused by a conflict between different blas libraries, so I think getting those dependencies together for caffe after installing clBLAS might have introduced some sort of conflict.

Right now my plan is to just go into the clBLAS cmake files and try to make sure it's getting the same libblas.so it originally used when installed the first time.

The fact that introducing new blas libraries after installing clBLAS seems to have broken it retroactively seems to spell trouble though. I think what really needs to be done is the amount of different BLAS libraries necessary to install OpenCL-caffe and it's dependencies needs to be minimized.

Given that OpenCL caffe needs an BLAS external to clBLAS, I suppose I should have tried to use the same one I used to install clBLAS, and then maybe this could have been avoided.

Same thing here. R9 270X with Xeon 1241
clBLAS test are completed with succsess
Ubuntu 14.04

Just installed ubuntu 15.10
Same thing happends as earlier (14.04)

There is also guy @doonny in original caffe issue with same issue running on W9100

I'm having a similar issue running on a W9100, I'm able to run the built in lenet training script but am unable to run anything like 'caffe train -solver etc'

Uhh? Maybe fix? No?

Is your setup still not working?

Haven't tested since last time. Don't think something changed.

thanks for letting us know about this issue. Past two weeks are my holidays
break. we will look into this soon.

Junli

On Fri, Feb 12, 2016 at 12:02 AM, sliterok notifications@github.com wrote:

Haven't tested since last time. Don't think something changed.


Reply to this email directly or view it on GitHub
#22 (comment).


Junli Gu--谷俊丽
Coordinated Science Lab
University of Illinois at Urbana-Champaign


I managed to get past the following errors:

Linking Fortran executable ../staging/test-correctness
CMakeFiles/test-correctness.dir/correctness/blas-lapack.c.o: In function cdotu': /home/nathan/Downloads/clBLAS/src/tests/correctness/blas-lapack.c:658: undefined reference tocdotusub_'
CMakeFiles/test-correctness.dir/correctness/blas-lapack.c.o: In function zdotu': /home/nathan/Downloads/clBLAS/src/tests/correctness/blas-lapack.c:673: undefined reference tozdotusub_'
CMakeFiles/test-correctness.dir/correctness/blas-lapack.c.o: In function cdotc': /home/nathan/Downloads/clBLAS/src/tests/correctness/blas-lapack.c:688: undefined reference tocdotcsub_'
CMakeFiles/test-correctness.dir/correctness/blas-lapack.c.o: In function zdotc': /home/nathan/Downloads/clBLAS/src/tests/correctness/blas-lapack.c:703: undefined reference tozdotcsub_'
collect2: error: ld returned 1 exit status

My solution might be way too hacky, but it works.

_Context: _

I installed blas using:
sudo apt-get install libopenblas-base libopenblas-dev

I observed that new directory - /usr/lib/openblas-base is created and there was a file libblas.so. There was also a same file in /usr/lib. diff confirmed both files are same. CMakeCache.txt confirmed that this is the linked library: Netlib_BLAS_LIBRARY:FILEPATH=/usr/lib/libblas.so

I opened the clBLAS/src/tests/correctness/blas-lapack.c, the zdotu function is conditional coding based on OS. I elfread libblas.so | grep zdotusub_ it was not found. But there was a function cblas_zdotu_sub, which should be there in case OS was Apple's. But anyways.. I replaced respective lines using calling convention on Apple platform. And it worked.

Pl confirm if this is reproducible, I would like to make my first ever PR :)

PS: I do not understand the code upside down. I have no idea why there are different signatures for different platforms. Thats why I mentioned solution as hack.

Regards,
Sagar

Bump?