Script for training embeddings

Question

Script for training embeddings

sa-j opened this issue 6 years ago · comments

Hi there,

Thanks for uploading the NER Tagger! I'm trying to build on the performance of your model for German. You already provided the pre-trained embeddings in issue #44 , however, I want to extend your corpus with some more text. Is it possible for you to upload the script with which the embeddings were produced?

Thank your very much!

@glample
@pvcastro
@julien-c

Pedro Vitor Quinta de Castro · Answer 1 · Thu Jun 07 2018 18:36:44 GMT+0800 (China Standard Time)

Sorry, I'm just working with Portuguese language, can't help you with scripts for German!

Saj · Answer 2 · Thu Jun 07 2018 19:02:15 GMT+0800 (China Standard Time)

Ok!

I'm actually looking for the original script with which the embeddings were trained on the Leipzig corpora collection & German monolingual training data from 2010 Machine Translation (according to the paper).

Guillaume Lample · Answer 3 · Thu Jun 07 2018 19:05:51 GMT+0800 (China Standard Time)

Hi,

We trained our embeddings using the wang2vec model, you can find it here:
https://github.com/wlin12/wang2vec

Saj · Answer 4 · Thu Jun 07 2018 19:36:21 GMT+0800 (China Standard Time)

Thank you! And do you have your preprocessing script with which you produced the texts for wang2vec? I want to exactly reproduce the GER64 embeddings (and therefore the results) for the NER tagger.

Guillaume Lample · Answer 5 · Thu Jun 07 2018 19:46:25 GMT+0800 (China Standard Time)

Sorry I don't remember about the preprocessing :/
But I think we only used the Moses tokenizer: https://github.com/moses-smt/mosesdecoder/

Saj · Answer 6 · Thu Jun 07 2018 19:48:56 GMT+0800 (China Standard Time)

Ok, thank you! And what about the the param settings for wang2vec, including the window size (should be different for Geman than for, say, English).

./word2vec -train input_file -output embedding_file -type 0 -size 50 -window 5 -negative 10 -nce 0 -hs 0 -sample 1e-4 -threads 1 -binary 1 -iter 5 -cap 0

Do you have them?

Guillaume Lample · Answer 7 · Thu Jun 07 2018 20:22:09 GMT+0800 (China Standard Time)

Parameters can be same for all languages. You should use -type 3 and also -size 50 is the dimension of your embeddings so you probably want more than that. GER64 uses 64 typically, but higher might be better.

Saj · Answer 8 · Fri Jun 08 2018 23:40:56 GMT+0800 (China Standard Time)

Which versions of the Leipzig corpora collections have you used? Minus "web", there are 4 text sources ("wiki,news,newscrawl,mixed") each consisting 30k to 1M sentences. Have you by chance used only the 1M variants with their most recent entry and merged all 4 documents?

Guillaume Lample · Answer 9 · Sat Jun 09 2018 02:25:02 GMT+0800 (China Standard Time)

Sorry I don't remember. I would just use everything.