cudpp / cudpp

CUDA Data Parallel Primitives Library

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Suffix Array test fails on testrig --all (Win7 x64, sm_21, CUDA 6.5)

harrism opened this issue · comments

Windows 7 x64. Driver version 340.46. CUDA 6.5 RC.

When I run the test standalone, it passes

NVS 4200M; global mem: 1073741824B; compute v2.1; clock: 1620000 kHz
Driver API: 6050; driver version: 6050; runtime version: 6050
Running a Suffix Array test of 4194304 uchar nodes
test PASSED

When I run cudpp_testrig.exe -all -iterations=1, it fails:

Running a Suffix Array test of 4194304 uchar nodes
test PASSED
Average execution time: 983.660339 ms
Running a Suffix Array test of 8388608 uchar nodes
cudpp_testrig.exe : ERROR, i = 0,     7647382 / 6306052 (reference / data)
At line:1 char:20
+ .\cudpp_testrig.exe <<<<  -all -iterations=1 > testrig_debug_all.txt 2>&1
    + CategoryInfo          : NotSpecified: (ERROR, i = 0,     ...ference / dat
a)    :String) [], RemoteException
    + FullyQualifiedErrorId : NativeCommandError

ERROR, i = 1,     6480078 / 4194306 (reference / data)
...
...

Cuda error: freeSaStorage in file 'C:/src/cudpp_temp/cudpp/src/cudpp/app/sa_app
.cu' in line 359 : invalid configuration argument.

test FAILED
Average execution time: 1666.174805 ms

Leyuan, could you make sure that the diff output when SA fails is of reasonable size? Print the first n errors, where n is small?

Mark, could you update to the 6.5 CUDA release? Since it's released, no reason why we should support RC, and there's a tiny chance that they actually fixed something ...

I'll be facing the same error if I turn on -sm20 on a machine with cc 3.0, which means the sm is not consistent with the compute capability of the machine. I wonder if the reason could be that you turned on -sm20 while it's actually sm21?

Andy and I have tested for sm=20 and this error happens. Currently the problem is suffix array or sometimes scan (as Andrew tested) won't pass when sm is exactly 2.0 but everything will pass when you set sm>2.0 (the next option is 3.0 in cudpp) in cmake -i

Mark, we would appreciate your guidance here. The concern is that compiling to SM 2.0 in CUDPP does not appear to work even as compiling to a higher SM version DOES work (for hardware that is capable of running either). Is that a CUDPP issue or an NVIDIA issue? I'm concerned it's not a CUDPP issue; not sure how we could compile to SM 2.0 and make it work properly if it DOES work properly on a higher SM. Anyway, if SM 2.0 code generation is an issue, but it works properly with a newer SM version, then we may choose to address it with a release note.

In Hardware/CUDPP_SM form, SM 3.0 / SM 2.0, SM 5.0 / SM 2.0 the same error happens when running bin/cudpp_testrig -all

(I don't think we have any hardware that is exactly SM 2.0.)

SM 3.0/ SM 3.0 pass, SM 3.0 / SM 2.0 won't pass for bin/cudpp_testrig -all
SM 5.0/ SM 5.0 pass, SM 5.0 / SM 2.0 won't pass for bin/cudpp_testrig -all

SM 3.5/ SM 3.5 pass, SM 3.5 / SM 2.0 won't pass bin/cudpp_testrig -all, same error as Mark's

Is PTX being generated, and if so what version? (I can take a look later,
but thought someone might know).

So your X/Y notation below means (Actual Hardware) / (Compiled SASS
version)?

On Thu, Aug 28, 2014 at 7:38 AM, Laurawly notifications@github.com wrote:

SM 3.5/ SM 3.5 pass, SM 3.5 / SM 2.0 won't pass bin/cudpp_testrig -all,
same error as Mark's

Reply to this email directly or view it on GitHub
#148 (comment).

Mark Harris wrote:

Is PTX being generated, and if so what version? (I can take a look later,
but thought someone might know).

By default, I think it's not.

So your X/Y notation below means (Actual Hardware) / (Compiled SASS
version)?

Correct.

Mark, I think the proximate test for you to run is if you compile to the SM that is native to your GPU (the highest SM-version that it will run) and see what happens.

OK, this is a can of worms.

  1. Adding set(GENCODE_SM21 -gencode=arch=compute_20,code=sm_21 -gencode=arch=compute_20,code=compute_20) did indeed fix the failure I originally saw. However I now have a very reproduceable (but different) error for 8388608 nodes:
C:\src\cudpp_temp\cudpp-build\bin> .\cudpp_testrig.exe -sa -iterations=1 -n=8388
608
Using device 0:
NVS 4200M; global mem: 1073741824B; compute v2.1; clock: 1620000 kHz
Driver API: 6050; driver version: 6050; runtime version: 6050
Running a Suffix Array test of 8388608 uchar nodes
CUDA ERROR 77 an illegal memory access was encountered

2.) The original error was "invalid configuration", which implies requesting too many (or zero) threads per block or total blocks, or too much shared memory, or other possibilities. If it were a problem of compiling for the wrong architecture, it should raise a Invalid Device Function error.

3.) The code in question (sa_app.cu) has some issues. First, it calls cudaThreadSynchronize(), an API that was deprecated multiple CUDA versions ago in favor of cudaDeviceSynchronize() and cudaStreamSynchronize(). Second, it really shouldn't be synchronizing at all in a debug build, since it uses the NULL stream.

I am not satisfied that this is a compilation issue -- and I certainly couldn't file an NV bug. If you must release without diagnosing this issue, I recommend putting a known issue in the release notes. (Just link to this issue). The long compile+run latency means I don't have time to diagnose it myself...

Mark Harris wrote:

  1. Adding set(GENCODE_SM21 -gencode=arch=compute_20,code=sm_21 -gencode=arch=compute_20,code=compute_20) did indeed fix the failure I originally saw. However I now have a very reproduceable (but different) error for 8388608 nodes:
C:\src\cudpp_temp\cudpp-build\bin> .\cudpp_testrig.exe -sa -iterations=1 -n=8388
608
Using device 0:
NVS 4200M; global mem: 1073741824B; compute v2.1; clock: 1620000 kHz
Driver API: 6050; driver version: 6050; runtime version: 6050
Running a Suffix Array test of 8388608 uchar nodes
CUDA ERROR 77 an illegal memory access was encountered

Yowch. OK. I ran the same (several times) on my laptop (SM 3.0)
compiled with GENCODE_SM30; it runs correctly.

Then I compiled with your GENCODE_SM21 and got

Cuda error in file '/Users/jowens/Documents/working/cudpp/src/cudpp/app/sa_app.cu' in line 232 : unspecified launch failure.

Sigh.

2.) The original error was "invalid configuration", which implies requesting too many (or zero) threads per block or total blocks, or too much shared memory, or other possibilities. If it were a problem of compiling for the wrong architecture, it should raise a Invalid Device Function error.

Just to check, an NVIDIA GPU of compute version X should correctly run
code compiled for compute version Y if X >= Y?

3.) The code in question (sa_app.cu) has some issues. First, it calls cudaThreadSynchronize(), an API that was deprecated multiple CUDA versions ago in favor of cudaDeviceSynchronize() and cudaStreamSynchronize(). Second, it really shouldn't be synchronizing at all in a debug build, since it uses the NULL stream.

OK. I don't think we use streams anywhere in CUDPP at all, so this is
just search-and-replace cudaThreadSynchronize everywhere with
cudaDeviceSynchronize, I think. @Laurawly?

I am not satisfied that this is a compilation issue -- and I certainly couldn't file an NV bug. If you must release without diagnosing this issue, I recommend putting a known issue in the release notes. (Just link to this issue). The long compile+run latency means I don't have time to diagnose it myself...

I'd like to try to figure out what's going on, but like you say, can
of worms.

Just to check, an NVIDIA GPU of compute version X should correctly run code
compiled for compute version Y if X >= Y?

Certainly not. But if PTX of compute_Y is included, then the driver should
JIT it to sm_X and run it.

OK. I don't think we use streams anywhere in CUDPP at all, so this is just
search-and-replace cudaThreadSynchronize everywhere with cudaDeviceSynchronize,
I think. @Laurawly?

The only time to call cudaDeviceSynchronize() is when you are checking
errors, and only in a debug build. We do this in the CUDPP_SAFE_CALL and
CUDPP_CHECK_ERRORS (or whatever they are called) utility functions. They
should be used instead.

Mark

On Thu, Aug 28, 2014 at 2:29 PM, John Owens notifications@github.com
wrote:

Mark Harris wrote:

  1. Adding set(GENCODE_SM21 -gencode=arch=compute_20,code=sm_21 -gencode=arch=compute_20,code=compute_20) did indeed fix the failure I
    originally saw. However I now have a very reproduceable (but different)
    error for 8388608 nodes:
C:\src\cudpp_temp\cudpp-build\bin> .\cudpp_testrig.exe -sa -iterations=1
-n=8388
608
Using device 0:
NVS 4200M; global mem: 1073741824B; compute v2.1; clock: 1620000 kHz
Driver API: 6050; driver version: 6050; runtime version: 6050
Running a Suffix Array test of 8388608 uchar nodes
CUDA ERROR 77 an illegal memory access was encountered

Yowch. OK. I ran the same (several times) on my laptop (SM 3.0)
compiled with GENCODE_SM30; it runs correctly.

Then I compiled with your GENCODE_SM21 and got

Cuda error in file '/Users/jowens/Documents/working/cudpp/src/cudpp/app/
sa_app.cu' in line 232 : unspecified launch failure.

Sigh.

2.) The original error was "invalid configuration", which implies
requesting too many (or zero) threads per block or total blocks, or too
much shared memory, or other possibilities. If it were a problem of
compiling for the wrong architecture, it should raise a Invalid Device
Function

error.

Just to check, an NVIDIA GPU of compute version X should correctly run
code compiled for compute version Y if X >= Y?

3.) The code in question (sa_app.cu) has some issues. First, it calls
cudaThreadSynchronize(), an API that was deprecated multiple CUDA versions
ago in favor of cudaDeviceSynchronize() and cudaStreamSynchronize().
Second, it really shouldn't be synchronizing at all in a debug build, since
it uses the NULL stream.

OK. I don't think we use streams anywhere in CUDPP at all, so this is
just search-and-replace cudaThreadSynchronize everywhere with
cudaDeviceSynchronize, I think. @Laurawly?

I am not satisfied that this is a compilation issue -- and I certainly
couldn't file an NV bug. If you must release without diagnosing this issue,
I recommend putting a known issue in the release notes. (Just link to this
issue). The long compile+run latency means I don't have time to diagnose it
myself...

I'd like to try to figure out what's going on, but like you say, can
of worms.

Reply to this email directly or view it on GitHub
#148 (comment).

Just to check, an NVIDIA GPU of compute version X should correctly run code
compiled for compute version Y if X >= Y?

Certainly not. But if PTX of compute_Y is included, then the driver should
JIT it to sm_X and run it.

Right, that’s what I meant. How do you regard this particular issue that we’re having with getting incorrect results when sm_2.0 is JITed to something newer?

I regard it as I wrote before.

  1. The error I get is Invalid Configuration. If there were no valid binary
    for the present architecture the error should be Invalid Device Function.
  2. It only fails when I run -all tests, not when I run the test
    standalone.

Therefore I think it is something more complex than just a JIT issue.
Unfortunately it takes half an hour to compile every time I make a change,
so further diagnosis is difficult.

The release notes already point to the github issues list for known issues,
so I think it's fine to release with this issue unfixed.

On Fri, Aug 29, 2014 at 4:25 AM, John Owens notifications@github.com
wrote:

Just to check, an NVIDIA GPU of compute version X should correctly run
code
compiled for compute version Y if X >= Y?

Certainly not. But if PTX of compute_Y is included, then the driver
should
JIT it to sm_X and run it.

Right, that's what I meant. How do you regard this particular issue that
we're having with getting incorrect results when sm_2.0 is JITed to
something newer?

Reply to this email directly or view it on GitHub
#148 (comment).

OK. Docs updated to indicate there’s an issue.