when run main.py in "qa", get the result p_norm:['nan', 'nan','nan','nan','nan']

Question

when run main.py in "qa", get the result p_norm:['nan', 'nan','nan','nan','nan']

gailysun opened this issue 8 years ago · comments

Hi, taolei,
I am trying to use the code in "pt" and "qa" to train and finetune your rcnn model. Firstly, I use your pt code to pretrain your rcnn model. I set the hidden-dim =200 , the other arguments are set as you suggest. The pertrained model is saved in model.pkl.gz.pkl.gz and its gunzip file size is 1.3M. So I want to ask whether the file size of the pretrained model is reasonable?
When I use the pretrained model to finetune your rcnn model, that is I ues the code in "qa" to finetune rcnn model with the pretrained model. The get the following message:
0 empty titles ignored.
100406 pre-trained embeddings loaded.
vocab size=100408, corpus size=167765
/usr/lib64/python2.7/site-packages/numpy/core/fromnumeric.py:2652: VisibleDeprecationWarning: rank is deprecated; use the ndim attribute or function instead. To find the rank of a matrix see numpy.linalg.matrix_rank.
VisibleDeprecationWarning)
23.4045739174 to create batches
315 batches, 35312679 tokens in total, 360602 triples in total
h_title dtype: float64
h_avg_title dtype: float64
h_final dtype: float64
num of parameters: 160400
p_norm: ['nan', 'nan', 'nan', 'nan', 'nan']
The question is "p_norm:['nan', 'nan', 'nan', 'nan', 'nan']". The p_norm result is "nan". Here, the only difference is that I set hidden-dim = 200, the other arguments are set as you suggest. So could you tell me the reason that the p_norm result is "nan". I see your code, but I still can figure out the probelm. Look forward to your help. Thank you very much.

Tao Lei · Answer 1 · Sun Sep 18 2016 12:28:30 GMT+0800 (China Standard Time)

Hi @gailysun

Do you have the log of the pretraining run? I think it would be helpful to see the exact running options and the training information.

GailySun · Answer 2 · Sun Sep 18 2016 13:38:42 GMT+0800 (China Standard Time)

Hi,taolei,
The following is my pre_train log information:

Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 4007)
Namespace(activation='tanh', batch_size=256, corpus='/data1/gailsun/qa/data/text_tokenized.txt', cut_off=1, depth=1, dev='', dropout=0.1, embeddings='/data1/gailsun/qa/data/vector/vectors_pruned.200.txt', heldout='/data1/gailsun/qa/data/train_random.txt', hidden_dim=200, l2_reg=1e-05, layer='rcnn', learning='adam', learning_rate=0.001, max_epoch=50, max_seq_len=100, mode=1, model='model.pkl.gz', normalize=1, order=2, outgate=0, reweight=1, test='', train='/data1/gailsun/qa/data/train_random.txt', use_anno=1, use_body=1, use_title=1)

0 empty titles ignored.
100406 pre-trained embeddings loaded.
vocab size=100410, corpus size=167765
/usr/lib64/python2.7/site-packages/numpy/core/fromnumeric.py:2652: VisibleDeprecationWarning: rank is deprecated; use the ndim attribute or function instead. To find the rank of a matrix see numpy.linalg.matrix_rank.
VisibleDeprecationWarning)
heldout examples=139570
2.94957613945 to create batches
num of parameters: 20503210
p_norm: ['5.773', '5.777', '8.155', '0.402', '0.393', '5.777', '5.771', '8.166', '0.415', '0.428', '0.000', '9.131']
^M0/111^M10/111^M20/111^M30/111^M40/111^M50/111^M60/111^M70/111^M80/111^M90/111^M100/111^M110/111 model saved.

GailySun · Answer 3 · Sun Sep 18 2016 13:50:11 GMT+0800 (China Standard Time)

hi @taolei87 ,
Another difference is that I set THEANO_FLAGS = 'device=gpu,floatX=float64'.Will this affect the result?
The following is the current finetune log information.p_norm is always "nan", and the measurements seem not change.
Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 4007)
Namespace(activation='tanh', average=0, batch_size=40, corpus='/data1/gailsun/qa/data/text_tokenized.txt', cut_off=1, depth=1, dev='/data1/gailsun/qa/data/dev.txt', dropout=0.1, embeddings='/data1/gailsun/qa/data/vector/vectors_pruned.200.txt', hidden_dim=200, l2_reg=1e-05, layer='rcnn', learning='adam', learning_rate=0.001, load_pretrain='/data1/gailsun/qa/code/pt/model.pkl.gz.pkl.gz', max_epoch=50, max_seq_len=100, mode=1, normalize=1, order=2, outgate=0, reweight=1, save_model='model_d200_qa', test='/data1/gailsun/qa/data/test.txt', train='/data1/gailsun/qa/data/train_random.txt')

0 empty titles ignored.
100406 pre-trained embeddings loaded.
vocab size=100408, corpus size=167765
/usr/lib64/python2.7/site-packages/numpy/core/fromnumeric.py:2652: VisibleDeprecationWarning: rank is deprecated; use the ndim attribute or function instead. To find the rank of a matrix see numpy.linalg.matrix_rank.
VisibleDeprecationWarning)
23.4045739174 to create batches
315 batches, 35312679 tokens in total, 360602 triples in total
h_title dtype: float64
h_avg_title dtype: float64
h_final dtype: float64
num of parameters: 160400
p_norm: ['nan', 'nan', 'nan', 'nan', 'nan']
^M0/315^M10/315^M20/315^M30/315^M40/315^M50/315^M60/315^M70/315^M80/315^M90/315^M100/315^M110/315^M120/315^M130/315^M140/315^M150/315^M160/315^M170/315^M180/315^M190/315^M200/315^M210/315^M220/315^M230/315^M240/315^M250/315^M260/315^M270/315^M280/315^M290/315^M300/315^M310/315^M

Epoch 0 cost=nan loss=nan MRR=63.39,63.39 |g|=nan [58.735m]
p_norm: ['nan', 'nan', 'nan', 'nan', 'nan']

+-------+---------+---------+---------+---------+---------+---------+---------+---------+
| Epoch | dev MAP | dev MRR | dev P@1 | dev P@5 | tst MAP | tst MRR | tst P@1 | tst P@5 |
+-------+---------+---------+---------+---------+---------+---------+---------+---------+
| 0 | 44.87 | 63.39 | 51.85 | 31.01 | 42.81 | 62.98 | 53.76 | 26.99 |
+-------+---------+---------+---------+---------+---------+---------+---------+---------+
^M0/315^M10/315^M20/315^M30/315^M40/315^M50/315^M60/315^M70/315^M80/315^M90/315^M100/315^M110/315^M120/315^M130/315^M140/315^M150/315^M160/315^M170/315^M180/315^M190/315^M200/315^M210/315^M220/315^M230/315^M240/315^M250/315^M260/315^M270/315^M280/315^M290/315^M300/315^M310/315^M

Epoch 1 cost=nan loss=nan MRR=63.39,63.39 |g|=nan [58.200m]
p_norm: ['nan', 'nan', 'nan', 'nan', 'nan']

+-------+---------+---------+---------+---------+---------+---------+---------+---------+
| Epoch | dev MAP | dev MRR | dev P@1 | dev P@5 | tst MAP | tst MRR | tst P@1 | tst P@5 |
+-------+---------+---------+---------+---------+---------+---------+---------+---------+
| 0 | 44.87 | 63.39 | 51.85 | 31.01 | 42.81 | 62.98 | 53.76 | 26.99 |
+-------+---------+---------+---------+---------+---------+---------+---------+---------+
^M0/315^M10/315^M20/315^M30/315^M40/315^M50/315^M60/315^M70/315^M80/315^M90/315^M100/315^M110/315^M120/315^M130/315^M140/315^M150/315^M160/315^M170/315^M180/315^M190/315^M200/315^M210/315^M220/315

Tao Lei · Answer 4 · Sun Sep 18 2016 14:00:02 GMT+0800 (China Standard Time)

The pnorm is the L2 norm of parameters. In the fine-tuning log, the pnorm is NaN right after loading the model:
num of parameters: 160400 p_norm: ['nan', 'nan', 'nan', 'nan', 'nan']

This means the pre-training is not correctly run or has some error. During pre-training, I also print out necessary information such as the pnorms (here), which seems missing from the log you showed me.

Could you attach or send me the full log of pre-training run? I see that the dev set is empty (--dev option)? The model saving code logic is inside the dev evaluation part (here).

Tao Lei · Answer 5 · Sun Sep 18 2016 14:12:39 GMT+0800 (China Standard Time)

I'd better use "float32" by default. Most GPU only supports float32. Also it seems Theano doesn't support float64 in GPU mode. Here's what I found on this webpage:

You will also need to set floatX to be float32, along with your path to CUDA. Theano does not yet support float64 (it will soon), so float32 must, for now, be assigned to floatX.

GailySun · Answer 6 · Sun Sep 18 2016 14:30:16 GMT+0800 (China Standard Time)

hi, @taolei87 ,
I really appreciate that you answer my question in time. Thank you very much. The following is the current pre-train log information, where the arguments are set the same as you suggest. During pre-train, it appears that p_norm is "NAN". Hope you can help. Thank you very much.

Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 4007)
Namespace(activation='tanh', batch_size=256, corpus='/data1/gailsun/qa/data/text_tokenized.txt', cut_off=1, depth=1, dev='/data1/gailsun/qa/data/dev.txt', dropout=0.1, embeddings='/data1/gailsun/qa/data/vector/vectors_pruned.200.txt', heldout='/data1/gailsun/qa/data/heldout.txt', hidden_dim=400, l2_reg=1e-05, layer='rcnn', learning='adam', learning_rate=0.001, max_epoch=50, max_seq_len=100, mode=1, model='model_pt_d400', normalize=1, order=2, outgate=0, reweight=1, test='/data1/gailsun/qa/data/test.txt', train='/data1/gailsun/qa/data/train_random.txt', use_anno=1, use_body=1, use_title=1)

0 empty titles ignored.
WARNING: n_d (400) != init word vector size (200). Use 200 instead.
100406 pre-trained embeddings loaded.
vocab size=100410, corpus size=167765
/usr/lib64/python2.7/site-packages/numpy/core/fromnumeric.py:2652: VisibleDeprecationWarning: rank is deprecated; use the ndim attribute or function instead. To find the rank of a matrix see numpy.linalg.matrix_rank.
VisibleDeprecationWarning)
heldout examples=1989
3.02918314934 to create batches
num of parameters: 41066010
p_norm: ['8.165', '8.170', '14.155', '0.553', '0.562', '8.160', '8.164', '14.104', '0.602', '0.598', '0.000', '9.128']
^M0/732^M10/732^M20/732^M30/732^M40/732^M50/732^M60/732^M70/732^M80/732^M90/732^M100/732^M110/732^M120/732^M130/732^M140/732^M150/732^M160/732^M170/732^M180/732^M190/732^M200/732^M210/732^M220/732^M230/732^M240/732^M250/732^M260/732^M270/732^M280/732^M290/732^M300/732^M310/732^M320/732^M330/732^M340/732^M350/732^M360/732^M370/732^M380/732^M390/732^M400/732^M410/732^M420/732^M430/732^M440/732^M450/732^M460/732^M470/732^M480/732^M490/732^M500/732^M510/732^M520/732^M530/732^M540/732^M550/732^M560/732^M570/732^M580/732^M590/732^M600/732^M610/732^M620/732^M630/732^M640/732^M650/732^M660/732^M670/732^M680/732^M690/732^M700/732^M710/732^M720/732^M730/732 model saved.
^M

Epoch 0 cost=nan loss=nan nan MRR=63.39,63.39 PPL=nan |g|=nan [39.961m]
p_norm: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan']

+-------+---------+---------+---------+---------+---------+---------+---------+---------+
| Epoch | dev MAP | dev MRR | dev P@1 | dev P@5 | tst MAP | tst MRR | tst P@1 | tst P@5 |
+-------+---------+---------+---------+---------+---------+---------+---------+---------+
| 0 | 44.87 | 63.39 | 51.85 | 31.01 | 42.81 | 62.98 | 53.76 | 26.99 |
+-------+---------+---------+---------+---------+---------+---------+---------+---------+
^M0/732^M10/732^M20/732^M30/732^M40/732^M50/732^M60/732^M70/732^M80/732^M90/732^M100/732^M110/732^M120/732^M130/732^M140/732^M150/732^M160/732^M170/732^M180/732^M190/732^M200/732^M210/732^M220/732^M230/732^M240/732^M250/732^M260/732^M270/732^M280/732^M290/732^M300/732^M310/732^M320/732^M330/732^M340/732^M350/732^M360/732^M370/732^M380/732^M390/732^M400/732^M410/732^M420/732^M430/732^M440/732^M450/732^M460/732^M470/732^M480/732^M490/732^M500/732^M510/732^M520/732^M530/732^M540/732^M550/732^M560/732^M570/732^M580/732^M590/732^M600/732^M610/732^M620/732^M630/732^M640/732^M650/732^M660/732^M670/732^M680/732^M690/732^M700/732^M710/732^M720/732^M730/732^M

Epoch 1 cost=nan loss=nan nan MRR=63.39,63.39 PPL=nan |g|=nan [43.745m]
p_norm: ['nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan', 'nan']

+-------+---------+---------+---------+---------+---------+---------+---------+---------+
| Epoch | dev MAP | dev MRR | dev P@1 | dev P@5 | tst MAP | tst MRR | tst P@1 | tst P@5 |
+-------+---------+---------+---------+---------+---------+---------+---------+---------+
| 0 | 44.87 | 63.39 | 51.85 | 31.01 | 42.81 | 62.98 | 53.76 | 26.99 |
+-------+---------+---------+---------+---------+---------+---------+---------+---------+

Tao Lei · Answer 7 · Sun Sep 18 2016 14:59:27 GMT+0800 (China Standard Time)

Hi @gailysun

The training options look fine to me. I used to see NaN issue at some point, but it disappeared after switching the Theano version.

The version on my machine is: 0.7.0.dev-8d3a67b73fda49350d9944c9a24fc9660131861c; but I think 0.8.0 should also work.

What's your Theano version? It's a bit late in Boston time now. I can try your version on my machine later.

GailySun · Answer 8 · Sun Sep 18 2016 15:37:30 GMT+0800 (China Standard Time)

Hi, @taolei87 ,
My theano version is theano 0.8.2. Thank you very much.

Tao Lei · Answer 9 · Mon Sep 19 2016 07:30:46 GMT+0800 (China Standard Time)

@gailysun The error seems to come from a later commit I did on parameter initialization. See here.

Could you try changing "0.00" to "0.001" ? The NaN issue disappeared on my machine by fixing this.

GailySun · Answer 10 · Mon Sep 19 2016 16:42:49 GMT+0800 (China Standard Time)

Hi @taolei87 ,
Yes, when revise the W_val as 0.001, the code can run successfully. Thank you very much.