Problem about input embeddings generated by other algo.

Question

Problem about input embeddings generated by other algo.

geekinglcq opened this issue 8 years ago · comments

Hi, I noticed that in your paper 6.1, as the inefficiency of optimizing likelihood function including both Z and V, you choose to divide the process into two stages. First, get word embeddings and then take them as input in the second stage.

I wonder if it's ok when I input embeddings generated by other algorithm (e.g. word2vec ) instead of PSDvec.

I've tried it and got some wried results. My corpus includes 10000 docs that contains 3223788 validated words. The embedding as input is generated using w2v.

In iter1, loglike is 1.3e11, iter2 0.7e11, and as the process continues, the loglike keep decrease. Hence the best result always occurs after the first iterator instead of the last round. However, the output is quite reasonable based on "Most relevant words", but the strange behaviour of likelihood really bothers me.

askerlee · Answer 1 · Sun Oct 23 2016 16:13:10 GMT+0800 (China Standard Time)

Thank you for trying our code.

Yes you can use embeddings generated by other algorithms such as w2v. In theory there should be no significant performance difference, however as I tried, using w2v embeddings usually yield worse results. I guess the reason might be that w2v embeddings have less clear probabilistic interpretations.

I never found monotonically decreasing loglikes in my experiments. Sometimes the loglike decreases in a few iterations and increases again, and that's usually caused by too big GD steps. But monotonically decreasing loglikes... I have no idea. Sorry. If you don't mind, you could upload it somewhere and I'll try to figure out the reason.

Alternatively, you may try the embeddings trained by me on Wikipedia. You can find their URLs in Dropbox in the Github file "vectors&corpus_download_addr.txt".

lunar · Answer 2 · Sun Oct 23 2016 17:54:55 GMT+0800 (China Standard Time)

Hi, cause I wanted to apply it in chinese, so i didn't ues your embedings.

I found that the loglike is normal where the corpus contains only one doc.
Here I upload the log, topicvec, corpus and wordembeddings / unigramtable I use. I tried 5 times, respectively used 10, 100, 1000, 10000, and 20000 docs and named after cn_top_a/b/c/d/e.

Thank you for help.

Gabriele Pergola · Answer 3 · Sun May 07 2017 02:41:22 GMT+0800 (China Standard Time)

I write in this "discussion" because I think my question should be "in topic" :)

If I want to use an alternative word embeddings (e.g. word2vec), should I generate also the " top1gram"?
If so, can I generate the "top1gram" using the "gramcount.pl" script and the word embedding with an external tool? Is there any relation between the "top1gram" file and the word embedding structure?
(For example, some matching between indeces..).

askerlee · Answer 4 · Sun May 07 2017 04:58:13 GMT+0800 (China Standard Time)

Yeah you can do that. Top1gram file is actually used to get unigram probabilities.

…

On May 7, 2017 2:41 AM, "Gabriele Pergola" ***@***.***> wrote: I write in this "discussion" because I think my question should be "in topic" :) If I want to use an alternative word embeddings (e.g. word2vec), should I generate also the " top1gram"? If so, can I generate the "top1gram" using the "gramcount.pl" script and the word embedding with an external tool? Is there any relation between the "top1gram" file and the word embedding structure? (For example, some matching between indeces..). — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABgKJQD9OsZJon6-N8QtX6cnnD36p3Y4ks5r3L7SgaJpZM4KeEyb> .

Gabriele Pergola · Answer 5 · Mon May 08 2017 00:03:07 GMT+0800 (China Standard Time)

Ok, great! Thank you for your help!
And congratulation on your work :)