scoring error
sameerkhurana10 opened this issue · comments
hi,
i am getting the following error while calculating ppl on the test set:
Mapped name None to device cuda: GeForce GTX TITAN Black (0000:03:00.0)
2018-04-09 11:44:37,415 exception_handler: An unexpected KeyError exception occurred: 'Unable to get link info (bad symbol table node signature)'
Traceback will be written to debug log (enable with --log-level debug).
srun: error: sls-titan-0: task 0: Exited with exit code 2
(theano-lm) sameerk@sls-415-1:/data/sls/qcri/asr/sameer_v1/asr/kaldi-forked/kaldi/egs/mit_qcri/s5_language_modeling/theanolm/recipes/arabic$ srun -p gpu --gres=gpu:1 theanolm score exp/blstm256_voc80k_blstm/nnlm.h5 data/rnnlm_data_all/test.dat --output perplexity --log-level debug
2018-04-09 12:38:40,288 get_default_device: Context None device="GeForce GTX TITAN Black" ID="0000:03:00.0"
2018-04-09 12:38:40,291 from_file: Reading vocabulary from network state.
/data/sls/u/sameerk/anaconda3/envs/theano-lm/lib/python3.5/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
Using cuDNN version 6021 on context None
Mapped name None to device cuda: GeForce GTX TITAN Black (0000:03:00.0)
2018-04-09 12:38:40,292 exception_handler: An unexpected KeyError exception occurred: 'Unable to get link info (bad symbol table node signature)'
Traceback will be written to debug log (enable with --log-level debug).
2018-04-09 12:38:40,293 exception_handler: Traceback:
2018-04-09 12:38:40,339 exception_handler: File "/data/sls/u/sameerk/anaconda3/envs/theano-lm/bin/theanolm", line 147, in <module>
main()
2018-04-09 12:38:40,339 exception_handler: File "/data/sls/u/sameerk/anaconda3/envs/theano-lm/bin/theanolm", line 88, in main
args.command_function(args)
2018-04-09 12:38:40,340 exception_handler: File "/data/sls/u/sameerk/anaconda3/envs/theano-lm/lib/python3.5/site-packages/theanolm/commands/score.py", line 114, in score
default_device=default_device)
2018-04-09 12:38:40,340 exception_handler: File "/data/sls/u/sameerk/anaconda3/envs/theano-lm/lib/python3.5/site-packages/theanolm/network/network.py", line 280, in from_file
vocabulary = Vocabulary.from_state(state)
2018-04-09 12:38:40,340 exception_handler: File "/data/sls/u/sameerk/anaconda3/envs/theano-lm/lib/python3.5/site-packages/theanolm/vocabulary/vocabulary.py", line 289, in from_state
if 'words' not in h5_vocabulary:
2018-04-09 12:38:40,340 exception_handler: File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
2018-04-09 12:38:40,340 exception_handler: File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
2018-04-09 12:38:40,340 exception_handler: File "/data/sls/u/sameerk/anaconda3/envs/theano-lm/lib/python3.5/site-packages/h5py/_hl/group.py", line 319, in __contains__
return self._e(name) in self.id
2018-04-09 12:38:40,340 exception_handler: File "h5py/h5g.pyx", line 441, in h5py.h5g.GroupID.__contains__
2018-04-09 12:38:40,340 exception_handler: File "h5py/h5g.pyx", line 442, in h5py.h5g.GroupID.__contains__
2018-04-09 12:38:40,341 exception_handler: File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
2018-04-09 12:38:40,341 exception_handler: File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
2018-04-09 12:38:40,341 exception_handler: File "h5py/h5g.pyx", line 511, in h5py.h5g._path_valid
2018-04-09 12:38:40,341 exception_handler: File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
2018-04-09 12:38:40,341 exception_handler: File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
2018-04-09 12:38:40,341 exception_handler: File "h5py/h5l.pyx", line 212, in h5py.h5l.LinkProxy.exists
srun: error: sls-titan-0: task 0: Exited with exit code 2
score command:
srun -p gpu --gres=gpu:1 theanolm score exp/blstm256_voc80k_blstm/nnlm.h5 data/rnnlm_data_all/test.dat --output perplexity --log-level debug
train command:
theanolm train exp/blstm256_voc80k_blstm/nnlm.h5 --training-set data/rnnlm_data_all/transcript.dat --vocabulary data/rnnlm_data_all/input_80000.vocab --vocabulary-format words --sequence-length 25 --batch-size 32 --optimization-method adagrad --stopping-criterion no-improvement --cost cross-entropy --learning-rate 1 --gradient-decay-rate 0.9 --numerical-stability-term 1e-6 --num-noise-samples 1 --noise-distribution unigram --noise-dampening 0.5 --validation-frequency 1 --patience 0 --min-epochs 1 --max-epochs 15 --random-seed 1 --log-level debug --log-interval 200 --gradient-normalization 5 --architecture ../architectures/word-blstm256.arch --validation-file data/rnnlm_data_all/dev.dat
just checking the size of the model:
161k
. looks suspiciously small.
What does this error mean?
It seems that the model is corrupted. Looks like the HDF5 library throws a KeyError when trying to read the vocabulary from the model. So the problem is in training, not scoring. Is there something suspicious in the train log?
probably right. Other models are fine. I got bus error for this model. I think nothing to do with TheanoLM.