GPU Error

Question

GPU Error

Wrongful opened this issue 7 years ago · comments

I'm doing as the instructions say, and I'm running:
th sample.lua -checkpoint cv/checkpoint_312400.t7 -length 500
But it comes out with this error:

THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-3963/cutorch/lib/THC/THCGeneral.c line=66 error=30 : unknown error
/home/wrongful/torch/install/bin/luajit: /home/wrongful/torch/install/share/lua/5.1/trepl/init.lua:389: cuda runtime error (30) : unknown error at /tmp/luarocks_cutorch-scm-1-3963/cutorch/lib/THC/THCGeneral.c:66
stack traceback:
[C]: in function 'error'
/home/wrongful/torch/install/share/lua/5.1/trepl/init.lua:389: in function 'require'
sample.lua:24: in main chunk
[C]: in function 'dofile'
...gful/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50

I've never gotten this error before, and I've even tried re-installing cutorch, with no change. It's clearly linked to the GPU, since it works fine when it's run on CPU, but how can I fix this?

antihutka · Answer 1 · Sun May 07 2017 06:26:53 GMT+0800 (China Standard Time)

This could be a torch or driver/CUDA issue. Do other CUDA applications work?

Wrongful · Answer 2 · Sun May 07 2017 06:51:39 GMT+0800 (China Standard Time)

I don't have many, but they appear to be working... Running $ lspci -v shows the GPU in the list of devices. I also downloaded a GPU stress test called "Glxgers" and ran it, and it looks fine.

Wrongful · Answer 3 · Tue May 09 2017 05:20:24 GMT+0800 (China Standard Time)

Strange... I just tried it on GPU, and it works. I didn't do anything.
Well, I guess this issue is solved, somehow...

Wrongful · Answer 4 · Thu Aug 31 2017 06:57:29 GMT+0800 (China Standard Time)

I've returned to this issue four months after I posted it and I have the same problem as I did before, but slightly different numbers. Ex. 70 where 66 was in the original error and 2688 where 3963 was.

THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-2688/cutorch/lib/THC/THCGeneral.c line=70 error=30 : unknown error
/home/wrongful/torch/install/bin/luajit: /home/wrongful/torch/install/share/lua/5.1/trepl/init.lua:389: cuda runtime error (30) : unknown error at /tmp/luarocks_cutorch-scm-1-2688/cutorch/lib/THC/THCGeneral.c:70
stack traceback:
[C]: in function 'error'
/home/wrongful/torch/install/share/lua/5.1/trepl/init.lua:389: in function 'require'
sample.lua:24: in main chunk
[C]: in function 'dofile'
...gful/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50

If anyone has encountered and solved this error before, could you please explain how to fix it?

Wrongful · Answer 5 · Fri Sep 01 2017 06:26:29 GMT+0800 (China Standard Time)

Oh my god. I just tried running the command, and just like last time, the issue vanished for no reason.

For future reference, if you have a problem like this, turn off your computer for a while. If you come back and it's still there, wait longer. It'll vanish eventually.

Now, if you'll excuse me, I'm going to go cry in the corner.

Chris Cummins · Answer 6 · Fri Sep 01 2017 07:19:30 GMT+0800 (China Standard Time)

I'm glad you resolved your problem :)

Intermittent CUDA failure may be a problem with your drivers, or perhaps hardware. If the problem re-surfaces, I'd use $ nvidia-smi, $CUDA_VISIBLE_DEVICES, and other CUDA apps as @antihutka suggested to get started.

Closing this as it doesn't seem relevant to this specific project.