askerlee / topicvec

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

if i want to try it on other language, how to train 25000-180000-500-BLK-8.0.vec.npy? and..

zzks opened this issue · comments

commented

hi all,

if i want to try it on other language, how can i train 25000-180000-500-BLK-8.0.vec.npy and get top1grams-wiki.txt?
for example chinese language, I have pre trained w2v model of chn wikipedia. Can I get these files from this pre trained model?
Thanks!

top1grams-wiki.txt is generated by a Perl script https://github.com/askerlee/topicvec/blob/master/psdvec/gramcount.pl. You could generate it using the Chinese Wikipedia text as input. gramcount.pl will also generate top2grams-wiki.txt (two separate runs are needed for top1grams* and top2grams*). Then you use https://github.com/askerlee/topicvec/blob/master/psdvec/factorize.py to generate 25000-180000-500-BLK-8.0.vec, with both top1grams* and top2grams* as input.

You can find an example in https://github.com/askerlee/topicvec/blob/master/psdvec/PSDVec.pdf.

commented

Roger that!
Thank you for the quick response & detailed reply!

I noticed that the number of words in "top1grams" is different from the number of words in the word embedding. E.g. for the Wiki-dataset, "top1grams" has 286441 words while word embedding has 180000.
Does it matter?

Hi Askerlee,
thank you as usual! :)
I have got a problem about the "Mstep_sample_topwords" and I thought it was because of the gap between these two counts. However, it was due to a number of words in the word embedding smaller than "Mstep_sample_topwords". I fixed it.

Thanks!