realfastvla / rfgpu

GPU-based gridding and imaging library for realfast

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

crash on new GPU

caseyjlaw opened this issue · comments

I ran a search with rfgpu on SDM data on one of our new GPUs and it crashed. The search runs fine on the GTX 1080.

The crash was using rfnode002 gpu device 1 (which I think is the Titan Xp). It seemed to run well for a while, but eventually I got a crash and this error:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Array: 'cudaFree(d)' returned 'cudaErrorLaunchFailure'

Looking through my log, I see that at one point it made an image that had a ridiculously high SNR (1e15 or so). I was only triggering on SNR>15, so that was a very anomalous image. Perhaps related to the eventual crash?

Actually, it may be that I'm not using cudaSetDevice correctly. If I try to set it with device number 2, it returns an error cudaSetDevice returned 30.
The original issue described above used device number 1, which may also be incorrect. Suggestions?

Well, now I'm confused because I see this error when trying to use any of the GPUs on rfnode002. This behavior started near the end of Thursday, but seems to have not affected the CBE-based tests we did throughout Thursday.
Do you think that the swapping of CBE code could affect how CUDA is deployed?

Ok, I confirmed that the rfgpu code runs on the original GPUs on rfnode001 and rfnode003. Perhaps this was a persistent issue on rfnode002 but I just didn't notice during yesterday's CBE tests.
I'll ask K Scott.

To briefly summarize some of the email discussion: By the time I looked at it, the GPUs on rfnode002 were in a bad state. nvidia utility programs only listed two of them, and there were many GPU-related errors in dmesg. A reboot seems to have cleared up all this, and simple test programs using rfgpu run OK on all cards. It's not yet clear whether our code somehow caused all these problems, or if there is a hardware problem somewhere in this system. Will just need to keep a lookout for similar issues while continuing testing I guess.

I think we decided that one of the cards in this system was defective so I'm closing this. Let me know if you see any other problems of course!