Cannot serialise number: must not be NaN or Infinity
thehowl opened this issue · comments
Running torch-rnn, when saving checkpoints I get this from time to time:
/home/howl/torch-cl/install/bin/luajit: ./util/utils.lua:50: Cannot serialise number: must not be NaN or Infinity
stack traceback:
[C]: in function 'encode'
./util/utils.lua:50: in function 'write_json'
train.lua:234: in main chunk
[C]: in function 'dofile'
...l/torch-cl/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x55832286e450
I'm running torch-cl with the following:
th train.lua -input_h5 ../data.h5 -input_json ../data.json -gpu_backend opencl -init_from cv/checkpoint_74000.t7 -reset_iterations 0
The last two options were added because I had the problem already in previous runs.
Graphics card is a NVIDIA GeForce 620 OEM. Using OpenCL because running CUDA seems close to impossible or very hard anyway on my machine (it's sort of like an NVIDIA Optimus laptop, but it's a Dell workstation. Can find out the model if needed).
Running on Debian GNU/Linux sid (unstable).
As it turns out, this issue seems to be caused by Inf
trying to be added to the output (so for some reason when calculating loss there's a div by 0). When the loss history is encoded, it encounters the Inf and throws an error. If anyone else finds this issue, I quickly patched this in my local repo:
diff --git a/train.lua b/train.lua
index 52210ec..e11869b 100644
--- a/train.lua
+++ b/train.lua
@@ -185,7 +185,11 @@ for i = start_i + 1, num_iterations do
-- Take a gradient step and maybe print
-- Note that adam returns a singleton array of losses
local _, loss = optim.adam(f, params, optim_config)
- table.insert(train_loss_history, loss[1])
+ if loss[1] == math.huge or loss[1] == -math.huge or loss[1] ~= loss[1] then
+ print(string.format("Can't represent %f in JSON, so not adding to the training loss history", loss[1]))
+ else
+ table.insert(train_loss_history, loss[1])
+ end
if opt.print_every > 0 and i % opt.print_every == 0 then
local float_epoch = i / num_train + 1
local msg = 'Epoch %.2f / %d, i = %d / %d, loss = %f'
Handles +Inf
, -Inf
and NaN
.