Cannot serialise number: must not be NaN or Infinity

Question

Cannot serialise number: must not be NaN or Infinity

thehowl opened this issue 6 years ago · comments

Running torch-rnn, when saving checkpoints I get this from time to time:

/home/howl/torch-cl/install/bin/luajit: ./util/utils.lua:50: Cannot serialise number: must not be NaN or Infinity
stack traceback:
	[C]: in function 'encode'
	./util/utils.lua:50: in function 'write_json'
	train.lua:234: in main chunk
	[C]: in function 'dofile'
	...l/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
	[C]: at 0x55832286e450

I'm running torch-cl with the following:

th train.lua -input_h5 ../data.h5 -input_json ../data.json -gpu_backend opencl -init_from cv/checkpoint_74000.t7 -reset_iterations 0

The last two options were added because I had the problem already in previous runs.

Graphics card is a NVIDIA GeForce 620 OEM. Using OpenCL because running CUDA seems close to impossible or very hard anyway on my machine (it's sort of like an NVIDIA Optimus laptop, but it's a Dell workstation. Can find out the model if needed).

Running on Debian GNU/Linux sid (unstable).

Morgan · Answer 1 · Wed Mar 28 2018 01:32:41 GMT+0800 (China Standard Time)

As it turns out, this issue seems to be caused by Inf trying to be added to the output (so for some reason when calculating loss there's a div by 0). When the loss history is encoded, it encounters the Inf and throws an error. If anyone else finds this issue, I quickly patched this in my local repo:

diff --git a/train.lua b/train.lua
index 52210ec..e11869b 100644
--- a/train.lua
+++ b/train.lua
@@ -185,7 +185,11 @@ for i = start_i + 1, num_iterations do
   -- Take a gradient step and maybe print
   -- Note that adam returns a singleton array of losses
   local _, loss = optim.adam(f, params, optim_config)
-  table.insert(train_loss_history, loss[1])
+  if loss[1] == math.huge or loss[1] == -math.huge or loss[1] ~= loss[1] then
+    print(string.format("Can't represent %f in JSON, so not adding to the training loss history", loss[1]))
+  else
+    table.insert(train_loss_history, loss[1])
+  end
   if opt.print_every > 0 and i % opt.print_every == 0 then
     local float_epoch = i / num_train + 1
     local msg = 'Epoch %.2f / %d, i = %d / %d, loss = %f'

Handles +Inf, -Inf and NaN.