glample / tagger

Named Entity Recognition Tool

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Script for training embeddings

sa-j opened this issue · comments

commented

Hi there,

Thanks for uploading the NER Tagger! I'm trying to build on the performance of your model for German. You already provided the pre-trained embeddings in issue #44 , however, I want to extend your corpus with some more text. Is it possible for you to upload the script with which the embeddings were produced?

Thank your very much!

@glample
@pvcastro
@julien-c

Sorry, I'm just working with Portuguese language, can't help you with scripts for German!

commented

Ok!

I'm actually looking for the original script with which the embeddings were trained on the Leipzig corpora collection & German monolingual training data from 2010 Machine Translation (according to the paper).

Hi,

We trained our embeddings using the wang2vec model, you can find it here:
https://github.com/wlin12/wang2vec

commented

Thank you! And do you have your preprocessing script with which you produced the texts for wang2vec? I want to exactly reproduce the GER64 embeddings (and therefore the results) for the NER tagger.

Sorry I don't remember about the preprocessing :/
But I think we only used the Moses tokenizer: https://github.com/moses-smt/mosesdecoder/

commented

Ok, thank you! And what about the the param settings for wang2vec, including the window size (should be different for Geman than for, say, English).

./word2vec -train input_file -output embedding_file -type 0 -size 50 -window 5 -negative 10 -nce 0 -hs 0 -sample 1e-4 -threads 1 -binary 1 -iter 5 -cap 0

Do you have them?

Parameters can be same for all languages. You should use -type 3 and also -size 50 is the dimension of your embeddings so you probably want more than that. GER64 uses 64 typically, but higher might be better.

commented

Which versions of the Leipzig corpora collections have you used? Minus "web", there are 4 text sources ("wiki,news,newscrawl,mixed") each consisting 30k to 1M sentences. Have you by chance used only the 1M variants with their most recent entry and merged all 4 documents?

Sorry I don't remember. I would just use everything.