jcjohnson / torch-rnn

Efficient, reusable RNNs and LSTMs for torch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cannot serialise number: must not be NaN or Infinity

thehowl opened this issue · comments

Running torch-rnn, when saving checkpoints I get this from time to time:

/home/howl/torch-cl/install/bin/luajit: ./util/utils.lua:50: Cannot serialise number: must not be NaN or Infinity
stack traceback:
	[C]: in function 'encode'
	./util/utils.lua:50: in function 'write_json'
	train.lua:234: in main chunk
	[C]: in function 'dofile'
	...l/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
	[C]: at 0x55832286e450

I'm running torch-cl with the following:

th train.lua -input_h5 ../data.h5 -input_json ../data.json -gpu_backend opencl -init_from cv/checkpoint_74000.t7 -reset_iterations 0

The last two options were added because I had the problem already in previous runs.

Graphics card is a NVIDIA GeForce 620 OEM. Using OpenCL because running CUDA seems close to impossible or very hard anyway on my machine (it's sort of like an NVIDIA Optimus laptop, but it's a Dell workstation. Can find out the model if needed).

Running on Debian GNU/Linux sid (unstable).

As it turns out, this issue seems to be caused by Inf trying to be added to the output (so for some reason when calculating loss there's a div by 0). When the loss history is encoded, it encounters the Inf and throws an error. If anyone else finds this issue, I quickly patched this in my local repo:

diff --git a/train.lua b/train.lua
index 52210ec..e11869b 100644
--- a/train.lua
+++ b/train.lua
@@ -185,7 +185,11 @@ for i = start_i + 1, num_iterations do
   -- Take a gradient step and maybe print
   -- Note that adam returns a singleton array of losses
   local _, loss = optim.adam(f, params, optim_config)
-  table.insert(train_loss_history, loss[1])
+  if loss[1] == math.huge or loss[1] == -math.huge or loss[1] ~= loss[1] then
+    print(string.format("Can't represent %f in JSON, so not adding to the training loss history", loss[1]))
+  else
+    table.insert(train_loss_history, loss[1])
+  end
   if opt.print_every > 0 and i % opt.print_every == 0 then
     local float_epoch = i / num_train + 1
     local msg = 'Epoch %.2f / %d, i = %d / %d, loss = %f'

Handles +Inf, -Inf and NaN.