crash on new GPU

Question

crash on new GPU

caseyjlaw opened this issue 6 years ago · comments

I ran a search with rfgpu on SDM data on one of our new GPUs and it crashed. The search runs fine on the GTX 1080.

The crash was using rfnode002 gpu device 1 (which I think is the Titan Xp). It seemed to run well for a while, but eventually I got a crash and this error:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Array: 'cudaFree(d)' returned 'cudaErrorLaunchFailure'

Looking through my log, I see that at one point it made an image that had a ridiculously high SNR (1e15 or so). I was only triggering on SNR>15, so that was a very anomalous image. Perhaps related to the eventual crash?

Casey Law · Answer 1 · Fri Jun 08 2018 07:41:47 GMT+0800 (China Standard Time)

Actually, it may be that I'm not using cudaSetDevice correctly. If I try to set it with device number 2, it returns an error cudaSetDevice returned 30.
The original issue described above used device number 1, which may also be incorrect. Suggestions?

Casey Law · Answer 2 · Fri Jun 08 2018 21:10:00 GMT+0800 (China Standard Time)

Well, now I'm confused because I see this error when trying to use any of the GPUs on rfnode002. This behavior started near the end of Thursday, but seems to have not affected the CBE-based tests we did throughout Thursday.
Do you think that the swapping of CBE code could affect how CUDA is deployed?

Casey Law · Answer 3 · Fri Jun 08 2018 21:34:27 GMT+0800 (China Standard Time)

Ok, I confirmed that the rfgpu code runs on the original GPUs on rfnode001 and rfnode003. Perhaps this was a persistent issue on rfnode002 but I just didn't notice during yesterday's CBE tests.
I'll ask K Scott.

Paul Demorest · Answer 4 · Fri Jun 08 2018 22:05:02 GMT+0800 (China Standard Time)

Casey, I'll look at this today also.

…

On Fri, Jun 8, 2018, 7:34 AM Casey Law ***@***.***> wrote: Ok, I confirmed that the rfgpu code runs on the original GPUs on rfnode001 and rfnode003. Perhaps this was a persistent issue on rfnode002 but I just didn't notice during yesterday's CBE tests. I'll ask K Scott. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#11 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAJw8au1nVrs7WB4kXZiiqE53OND7nolks5t6n1ogaJpZM4UfR7f> .

Paul Demorest · Answer 5 · Sat Jun 09 2018 00:00:04 GMT+0800 (China Standard Time)

To briefly summarize some of the email discussion: By the time I looked at it, the GPUs on rfnode002 were in a bad state. nvidia utility programs only listed two of them, and there were many GPU-related errors in dmesg. A reboot seems to have cleared up all this, and simple test programs using rfgpu run OK on all cards. It's not yet clear whether our code somehow caused all these problems, or if there is a hardware problem somewhere in this system. Will just need to keep a lookout for similar issues while continuing testing I guess.

Paul Demorest · Answer 6 · Wed Aug 22 2018 01:44:18 GMT+0800 (China Standard Time)

I think we decided that one of the cards in this system was defective so I'm closing this. Let me know if you see any other problems of course!