if i want to try it on other language, how to train 25000-180000-500-BLK-8.0.vec.npy? and..

Question

if i want to try it on other language, how to train 25000-180000-500-BLK-8.0.vec.npy? and..

zzks opened this issue 7 years ago · comments

hi all,

if i want to try it on other language, how can i train 25000-180000-500-BLK-8.0.vec.npy and get top1grams-wiki.txt?
for example chinese language, I have pre trained w2v model of chn wikipedia. Can I get these files from this pre trained model?
Thanks!

askerlee · Answer 1 · Fri Jan 27 2017 15:18:08 GMT+0800 (China Standard Time)

top1grams-wiki.txt is generated by a Perl script https://github.com/askerlee/topicvec/blob/master/psdvec/gramcount.pl. You could generate it using the Chinese Wikipedia text as input. gramcount.pl will also generate top2grams-wiki.txt (two separate runs are needed for top1grams* and top2grams*). Then you use https://github.com/askerlee/topicvec/blob/master/psdvec/factorize.py to generate 25000-180000-500-BLK-8.0.vec, with both top1grams* and top2grams* as input.

You can find an example in https://github.com/askerlee/topicvec/blob/master/psdvec/PSDVec.pdf.

zzks · Answer 2 · Fri Jan 27 2017 18:01:57 GMT+0800 (China Standard Time)

Roger that!
Thank you for the quick response & detailed reply!

Gabriele Pergola · Answer 3 · Thu May 11 2017 07:34:54 GMT+0800 (China Standard Time)

I noticed that the number of words in "top1grams" is different from the number of words in the word embedding. E.g. for the Wiki-dataset, "top1grams" has 286441 words while word embedding has 180000.
Does it matter?

askerlee · Answer 4 · Thu May 11 2017 10:53:07 GMT+0800 (China Standard Time)

It doesn't matter. Words in the word embedding file should be a subset of those in top1grams.txt. Extra words in top1grams.txt will be ignored.

…

On Thu, May 11, 2017 at 7:34 AM, Gabriele Pergola ***@***.***> wrote: I noticed that the number of words in "top1grams" is different from the number of words in the word embedding. E.g. for the Wiki-dataset, "top1grams" has 286441 words while word embedding has 180000. Does it matter? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABgKJU1OkZPA6v4X9x-o2oor0yA0x30-ks5r4kmegaJpZM4LvfHi> .

Gabriele Pergola · Answer 5 · Thu May 11 2017 17:27:12 GMT+0800 (China Standard Time)

Hi Askerlee,
thank you as usual! :)
I have got a problem about the "Mstep_sample_topwords" and I thought it was because of the gap between these two counts. However, it was due to a number of words in the word embedding smaller than "Mstep_sample_topwords". I fixed it.

Thanks!

askerlee · Answer 6 · Thu May 11 2017 20:06:33 GMT+0800 (China Standard Time)

I see. I didn't consider this situation as I mainly use my own embeddings. Yeah it's better to be fixed. Thanks for finding out.

…

On Thu, May 11, 2017 at 5:27 PM, Gabriele Pergola ***@***.***> wrote: Hi Askerlee, thank you as usual! :) I have got a problem about the "Mstep_sample_topwords" and I thought it was because of the gap between these two counts. However, it was due to a number of words in the word embedding smaller than "Mstep_sample_topwords". I fixed it. Thanks! — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABgKJegU8G4xGMutjpZVVELCXj-aDnj_ks5r4tRwgaJpZM4LvfHi> .