nfmcclure / tensorflow_cookbook

Code for Tensorflow Machine Learning Cookbook

Home Page:https://www.packtpub.com/big-data-and-business-intelligence/tensorflow-machine-learning-cookbook-second-edition

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Since the first training is done, document embedding hasn't been trained

joonable opened this issue · comments

Hello. I'm trying to use doc2vec algorithm in 07_Natural_Language_Processing/07_Sentiment_Analysis_With_Doc2Vec/07_sentiment_with_doc2vec.py.

I understood that first of the training is to train word and doc embedding and second one is for text classification, sentiment analysis. Because I needed distributed representations of words and docs, not a classifier, so just did first training.

After the training, I evaluated the vectors in word and document embedding using tf.saver, then found out doc embedding didn't change, but word embedding did. Doc embedding just stayed as initial value.

Did I understand the code and doc2vec algorithm not properly or is there any kind of bug in the code? Thank you for your answer in advanced.

commented

Hi @joonable ,
Thanks for asking.

I just checked this out briefly. I think I'll need more information as I cannot replicate the problem. For example, if I run the code through the variable initialization and create a feed-dictionary, then I run the following commands:

In[39]: sess.run(doc_embed, feed_dict=feed_dict)
Out[39]: 
array([[[ 0.36113167, -0.42523894,  0.08636531, ...,  0.9411001 ,
         -0.8095024 , -0.38859203]],

       [[ 0.36113167, -0.42523894,  0.08636531, ...,  0.9411001 ,
         -0.8095024 , -0.38859203]],

       [[ 0.36113167, -0.42523894,  0.08636531, ...,  0.9411001 ,
         -0.8095024 , -0.38859203]],

       ...,

       [[ 0.7726636 , -0.4221473 , -0.28463227, ..., -0.00291947,
          0.49912193, -0.26189896]],

       [[ 0.7726636 , -0.4221473 , -0.28463227, ..., -0.00291947,
          0.49912193, -0.26189896]],

       [[ 0.7726636 , -0.4221473 , -0.28463227, ..., -0.00291947,
          0.49912193, -0.26189896]]], dtype=float32)
In[40]: sess.run(train_step, feed_dict=feed_dict)
In[41]: sess.run(doc_embed, feed_dict=feed_dict)
Out[41]: 
array([[[ 0.3611314 , -0.42523894,  0.08636572, ...,  0.94110006,
         -0.8095023 , -0.38859165]],

       [[ 0.3611314 , -0.42523894,  0.08636572, ...,  0.94110006,
         -0.8095023 , -0.38859165]],

       [[ 0.3611314 , -0.42523894,  0.08636572, ...,  0.94110006,
         -0.8095023 , -0.38859165]],

       ...,

       [[ 0.7726636 , -0.42214715, -0.2846323 , ..., -0.00291951,
          0.49912196, -0.26189905]],

       [[ 0.7726636 , -0.42214715, -0.2846323 , ..., -0.00291951,
          0.49912196, -0.26189905]],

       [[ 0.7726636 , -0.42214715, -0.2846323 , ..., -0.00291951,
          0.49912196, -0.26189905]]], dtype=float32)

This shows me that the variable doc_embed is changing due to the training. Are you seeing something different? If so, make sure you have the most up-to-date code and also let me know your python and TensorFlow versions.

I'll continue to troubleshoot with you if you see something different. I think the next step would be to fix a random seed for TensorFlow and Numpy and see what we can do, assuming we have the same versions for everything. For reference, I'm running Python 3.6 and TensorFlow v1.10.1.

Thanks.

I checked it as the way you presented and there is surely a difference after the training. Many appolgies
add the code below.

`

In[32]: doc_origin = doc_embeddings.eval(sess)
In[33]: for i in range(5000) : sess.run(train_step, feed_dict=feed_dict)
In[34]: doc_eval = doc_embeddings.eval(sess)
In[35]: doc_origin - doc_eval
Out[35]:
array([[ 7.57873058e-05, 5.39273024e-05, -3.54051590e-05, ...,
3.53846699e-05, 4.13656235e-05, -6.90221786e-05],
[-1.03056431e-04, 3.75509262e-06, 6.04987144e-05, ...,
-3.19242477e-04, -1.00910664e-04, -1.72302127e-04],
[-1.60932541e-06, 3.51667404e-06, -8.94069672e-07, ...,
-2.14576721e-06, -4.35113907e-06, 1.75833702e-06],
...,
[-1.67489052e-04, 1.15454197e-04, 2.23517418e-05, ...,
-2.02655792e-06, -8.34465027e-06, 5.33461571e-05],
[-4.18424606e-05, -1.31400302e-05, -2.86102295e-05, ...,
1.12056732e-05, -6.37024641e-06, 4.05311584e-06],
[-1.25169754e-05, -2.87890434e-05, 1.23977661e-05, ...,
-1.21593475e-05, -6.26444817e-05, 5.59091568e-05]], dtype=float32)

`

It's not about troubleshooting, but I have a problem to solve. I'm using doc2vec for clustering unlabelled documents. However, as you see, the difference is too small that they just stay in random.uniform as initialised. I trained them with enough iterations then the losses at every step don't converge anymore though.

Even after more than 200K iterations, doc2vec don't changed a lot.

`

In[36]: for i in range(200000) : sess.run(train_step, feed_dict=feed_dict)
In[37]: doc_eval_200K = doc_embeddings.eval(sess)
In[38]: doc_origin
Out[38]:
array([[-0.40346146, -0.22738123, 0.6981292 , ..., 0.02518272,
0.6519067 , 0.5756016 ],
[-0.71823335, 0.9682684 , -0.47529078, ..., -0.44264603,
-0.84275126, 0.1408112 ],
[-0.91523314, 0.63673115, 0.33543396, ..., -0.635123 ,
0.8932848 , -0.0469408 ],
...,
[-0.95611143, 0.63165283, 0.20844555, ..., -0.95574784,
0.803643 , 0.8626468 ],
[-0.87971663, -0.00883818, 0.8690052 , ..., -0.9107895 ,
0.11327219, 0.52236867],
[ 0.9117298 , 0.5722585 , 0.87356305, ..., -0.65226054,
-0.31751704, -0.7709594 ]], dtype=float32)
In[39]: doc_eval_200K
Out[39]:
array([[-0.40350893, -0.22704063, 0.6981595 , ..., 0.02526901,
0.6520689 , 0.57503295],
[-0.71688116, 0.9653056 , -0.47172707, ..., -0.44319224,
-0.83652633, 0.13944209],
[-0.91519636, 0.6366936 , 0.335434 , ..., -0.6351552 ,
0.89335924, -0.04687748],
...,
[-0.9555648 , 0.6306714 , 0.20914698, ..., -0.955652 ,
0.8043847 , 0.86161727],
[-0.8796184 , -0.00869403, 0.8691123 , ..., -0.91070646,
0.11326376, 0.52240765],
[ 0.91199374, 0.57255834, 0.8732707 , ..., -0.65196055,
-0.3172496 , -0.7709833 ]], dtype=float32)

`

When to use gensim, I can see clear difference but I should use tensorflow for my research to transform the algorithm. If you give me any advice, will be definitely helpful. Thank you.

commented

Hi @joonable ,
They do change very slowly, I agree. You can try a few things:

  • Try increasing the learning rate. You'll have to be careful here, because increasing the learning rate can make the algorithm not converge on a whole. If you increase the learning rate, you may also try incorporating some sort of decaying learning rate schedule as well.
  • If the above isn't working well, you can also increase the learning rate on just one layer (document embedding variable in your case) For an example on how to do this, see https://stackoverflow.com/questions/34945554/how-to-set-layer-wise-learning-rate-in-tensorflow

I hope that helps!