jcjohnson / torch-rnn

Efficient, reusable RNNs and LSTMs for torch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

GPU Error

Wrongful opened this issue · comments

I'm doing as the instructions say, and I'm running:
th sample.lua -checkpoint cv/checkpoint_312400.t7 -length 500
But it comes out with this error:

THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-3963/cutorch/lib/THC/THCGeneral.c line=66 error=30 : unknown error
/home/wrongful/torch/install/bin/luajit: /home/wrongful/torch/install/share/lua/5.1/trepl/init.lua:389: cuda runtime error (30) : unknown error at /tmp/luarocks_cutorch-scm-1-3963/cutorch/lib/THC/THCGeneral.c:66
stack traceback:
[C]: in function 'error'
/home/wrongful/torch/install/share/lua/5.1/trepl/init.lua:389: in function 'require'
sample.lua:24: in main chunk
[C]: in function 'dofile'
...gful/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50

I've never gotten this error before, and I've even tried re-installing cutorch, with no change. It's clearly linked to the GPU, since it works fine when it's run on CPU, but how can I fix this?

This could be a torch or driver/CUDA issue. Do other CUDA applications work?

I don't have many, but they appear to be working... Running $ lspci -v shows the GPU in the list of devices. I also downloaded a GPU stress test called "Glxgers" and ran it, and it looks fine.

Strange... I just tried it on GPU, and it works. I didn't do anything.
Well, I guess this issue is solved, somehow...

I've returned to this issue four months after I posted it and I have the same problem as I did before, but slightly different numbers. Ex. 70 where 66 was in the original error and 2688 where 3963 was.

THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-2688/cutorch/lib/THC/THCGeneral.c line=70 error=30 : unknown error
/home/wrongful/torch/install/bin/luajit: /home/wrongful/torch/install/share/lua/5.1/trepl/init.lua:389: cuda runtime error (30) : unknown error at /tmp/luarocks_cutorch-scm-1-2688/cutorch/lib/THC/THCGeneral.c:70
stack traceback:
[C]: in function 'error'
/home/wrongful/torch/install/share/lua/5.1/trepl/init.lua:389: in function 'require'
sample.lua:24: in main chunk
[C]: in function 'dofile'
...gful/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50

If anyone has encountered and solved this error before, could you please explain how to fix it?

Oh my god. I just tried running the command, and just like last time, the issue vanished for no reason.

For future reference, if you have a problem like this, turn off your computer for a while. If you come back and it's still there, wait longer. It'll vanish eventually.

Now, if you'll excuse me, I'm going to go cry in the corner.

I'm glad you resolved your problem :)

Intermittent CUDA failure may be a problem with your drivers, or perhaps hardware. If the problem re-surfaces, I'd use $ nvidia-smi, $CUDA_VISIBLE_DEVICES, and other CUDA apps as @antihutka suggested to get started.

Closing this as it doesn't seem relevant to this specific project.