Since the first training is done, document embedding hasn't been trained

Question

Since the first training is done, document embedding hasn't been trained

joonable opened this issue 6 years ago · comments

Hello. I'm trying to use doc2vec algorithm in 07_Natural_Language_Processing/07_Sentiment_Analysis_With_Doc2Vec/07_sentiment_with_doc2vec.py.

I understood that first of the training is to train word and doc embedding and second one is for text classification, sentiment analysis. Because I needed distributed representations of words and docs, not a classifier, so just did first training.

After the training, I evaluated the vectors in word and document embedding using tf.saver, then found out doc embedding didn't change, but word embedding did. Doc embedding just stayed as initial value.

Did I understand the code and doc2vec algorithm not properly or is there any kind of bug in the code? Thank you for your answer in advanced.

Nick · Answer 1 · Mon Aug 27 2018 00:59:46 GMT+0800 (China Standard Time)

Hi @joonable ,
Thanks for asking.

I just checked this out briefly. I think I'll need more information as I cannot replicate the problem. For example, if I run the code through the variable initialization and create a feed-dictionary, then I run the following commands:

In[39]: sess.run(doc_embed, feed_dict=feed_dict)
Out[39]: 
array([[[ 0.36113167, -0.42523894,  0.08636531, ...,  0.9411001 ,
         -0.8095024 , -0.38859203]],

       [[ 0.36113167, -0.42523894,  0.08636531, ...,  0.9411001 ,
         -0.8095024 , -0.38859203]],

       [[ 0.36113167, -0.42523894,  0.08636531, ...,  0.9411001 ,
         -0.8095024 , -0.38859203]],

       ...,

       [[ 0.7726636 , -0.4221473 , -0.28463227, ..., -0.00291947,
          0.49912193, -0.26189896]],

       [[ 0.7726636 , -0.4221473 , -0.28463227, ..., -0.00291947,
          0.49912193, -0.26189896]],

       [[ 0.7726636 , -0.4221473 , -0.28463227, ..., -0.00291947,
          0.49912193, -0.26189896]]], dtype=float32)
In[40]: sess.run(train_step, feed_dict=feed_dict)
In[41]: sess.run(doc_embed, feed_dict=feed_dict)
Out[41]: 
array([[[ 0.3611314 , -0.42523894,  0.08636572, ...,  0.94110006,
         -0.8095023 , -0.38859165]],

       [[ 0.3611314 , -0.42523894,  0.08636572, ...,  0.94110006,
         -0.8095023 , -0.38859165]],

       [[ 0.3611314 , -0.42523894,  0.08636572, ...,  0.94110006,
         -0.8095023 , -0.38859165]],

       ...,

       [[ 0.7726636 , -0.42214715, -0.2846323 , ..., -0.00291951,
          0.49912196, -0.26189905]],

       [[ 0.7726636 , -0.42214715, -0.2846323 , ..., -0.00291951,
          0.49912196, -0.26189905]],

       [[ 0.7726636 , -0.42214715, -0.2846323 , ..., -0.00291951,
          0.49912196, -0.26189905]]], dtype=float32)

This shows me that the variable doc_embed is changing due to the training. Are you seeing something different? If so, make sure you have the most up-to-date code and also let me know your python and TensorFlow versions.

I'll continue to troubleshoot with you if you see something different. I think the next step would be to fix a random seed for TensorFlow and Numpy and see what we can do, assuming we have the same versions for everything. For reference, I'm running Python 3.6 and TensorFlow v1.10.1.

Thanks.

Joonable · Answer 2 · Mon Aug 27 2018 12:20:00 GMT+0800 (China Standard Time)

I checked it as the way you presented and there is surely a difference after the training. Many appolgies
add the code below.

`

In[32]: doc_origin = doc_embeddings.eval(sess)
In[33]: for i in range(5000) : sess.run(train_step, feed_dict=feed_dict)
In[34]: doc_eval = doc_embeddings.eval(sess)
In[35]: doc_origin - doc_eval
Out[35]:
array([[ 7.57873058e-05, 5.39273024e-05, -3.54051590e-05, ...,
3.53846699e-05, 4.13656235e-05, -6.90221786e-05],
[-1.03056431e-04, 3.75509262e-06, 6.04987144e-05, ...,
-3.19242477e-04, -1.00910664e-04, -1.72302127e-04],
[-1.60932541e-06, 3.51667404e-06, -8.94069672e-07, ...,
-2.14576721e-06, -4.35113907e-06, 1.75833702e-06],
...,
[-1.67489052e-04, 1.15454197e-04, 2.23517418e-05, ...,
-2.02655792e-06, -8.34465027e-06, 5.33461571e-05],
[-4.18424606e-05, -1.31400302e-05, -2.86102295e-05, ...,
1.12056732e-05, -6.37024641e-06, 4.05311584e-06],
[-1.25169754e-05, -2.87890434e-05, 1.23977661e-05, ...,
-1.21593475e-05, -6.26444817e-05, 5.59091568e-05]], dtype=float32)

`

It's not about troubleshooting, but I have a problem to solve. I'm using doc2vec for clustering unlabelled documents. However, as you see, the difference is too small that they just stay in random.uniform as initialised. I trained them with enough iterations then the losses at every step don't converge anymore though.

Even after more than 200K iterations, doc2vec don't changed a lot.

`

In[36]: for i in range(200000) : sess.run(train_step, feed_dict=feed_dict)
In[37]: doc_eval_200K = doc_embeddings.eval(sess)
In[38]: doc_origin
Out[38]:
array([[-0.40346146, -0.22738123, 0.6981292 , ..., 0.02518272,
0.6519067 , 0.5756016 ],
[-0.71823335, 0.9682684 , -0.47529078, ..., -0.44264603,
-0.84275126, 0.1408112 ],
[-0.91523314, 0.63673115, 0.33543396, ..., -0.635123 ,
0.8932848 , -0.0469408 ],
...,
[-0.95611143, 0.63165283, 0.20844555, ..., -0.95574784,
0.803643 , 0.8626468 ],
[-0.87971663, -0.00883818, 0.8690052 , ..., -0.9107895 ,
0.11327219, 0.52236867],
[ 0.9117298 , 0.5722585 , 0.87356305, ..., -0.65226054,
-0.31751704, -0.7709594 ]], dtype=float32)
In[39]: doc_eval_200K
Out[39]:
array([[-0.40350893, -0.22704063, 0.6981595 , ..., 0.02526901,
0.6520689 , 0.57503295],
[-0.71688116, 0.9653056 , -0.47172707, ..., -0.44319224,
-0.83652633, 0.13944209],
[-0.91519636, 0.6366936 , 0.335434 , ..., -0.6351552 ,
0.89335924, -0.04687748],
...,
[-0.9555648 , 0.6306714 , 0.20914698, ..., -0.955652 ,
0.8043847 , 0.86161727],
[-0.8796184 , -0.00869403, 0.8691123 , ..., -0.91070646,
0.11326376, 0.52240765],
[ 0.91199374, 0.57255834, 0.8732707 , ..., -0.65196055,
-0.3172496 , -0.7709833 ]], dtype=float32)

`

When to use gensim, I can see clear difference but I should use tensorflow for my research to transform the algorithm. If you give me any advice, will be definitely helpful. Thank you.

Nick · Answer 3 · Thu Aug 30 2018 10:23:54 GMT+0800 (China Standard Time)

Hi @joonable ,
They do change very slowly, I agree. You can try a few things:

Try increasing the learning rate. You'll have to be careful here, because increasing the learning rate can make the algorithm not converge on a whole. If you increase the learning rate, you may also try incorporating some sort of decaying learning rate schedule as well.
If the above isn't working well, you can also increase the learning rate on just one layer (document embedding variable in your case) For an example on how to do this, see https://stackoverflow.com/questions/34945554/how-to-set-layer-wise-learning-rate-in-tensorflow

I hope that helps!