askerlee / topicvec

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Problem about input embeddings generated by other algo.

geekinglcq opened this issue · comments

commented

Hi, I noticed that in your paper 6.1, as the inefficiency of optimizing likelihood function including both Z and V, you choose to divide the process into two stages. First, get word embeddings and then take them as input in the second stage.

I wonder if it's ok when I input embeddings generated by other algorithm (e.g. word2vec ) instead of PSDvec.

I've tried it and got some wried results. My corpus includes 10000 docs that contains 3223788 validated words. The embedding as input is generated using w2v.

In iter1, loglike is 1.3e11, iter2 0.7e11, and as the process continues, the loglike keep decrease. Hence the best result always occurs after the first iterator instead of the last round. However, the output is quite reasonable based on "Most relevant words", but the strange behaviour of likelihood really bothers me.

Thank you for trying our code.

Yes you can use embeddings generated by other algorithms such as w2v. In theory there should be no significant performance difference, however as I tried, using w2v embeddings usually yield worse results. I guess the reason might be that w2v embeddings have less clear probabilistic interpretations.

I never found monotonically decreasing loglikes in my experiments. Sometimes the loglike decreases in a few iterations and increases again, and that's usually caused by too big GD steps. But monotonically decreasing loglikes... I have no idea. Sorry. If you don't mind, you could upload it somewhere and I'll try to figure out the reason.

Alternatively, you may try the embeddings trained by me on Wikipedia. You can find their URLs in Dropbox in the Github file "vectors&corpus_download_addr.txt".

commented

Hi, cause I wanted to apply it in chinese, so i didn't ues your embedings.

I found that the loglike is normal where the corpus contains only one doc.
Here I upload the log, topicvec, corpus and wordembeddings / unigramtable I use. I tried 5 times, respectively used 10, 100, 1000, 10000, and 20000 docs and named after cn_top_a/b/c/d/e.

Thank you for help.

I write in this "discussion" because I think my question should be "in topic" :)

If I want to use an alternative word embeddings (e.g. word2vec), should I generate also the " top1gram"?
If so, can I generate the "top1gram" using the "gramcount.pl" script and the word embedding with an external tool? Is there any relation between the "top1gram" file and the word embedding structure?
(For example, some matching between indeces..).

Ok, great! Thank you for your help!
And congratulation on your work :)