glample / tagger

Named Entity Recognition Tool

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Where is pretrained word embeddings?

WaveLi123 opened this issue · comments

Where is pretrained word embeddings?

I am also curious about it. I sent the author an email (He is in Facebook now) but no response.

Hi,

Sorry about this, I probably forgot to reply.. Here are the embeddings: https://drive.google.com/open?id=0B23ji47zTOQNUXYzbUVhdDN2ZmM

Best,
Guillaume

Thank you! Is this just English?

Yes. Here are the others:

Dutch: https://drive.google.com/open?id=0B23ji47zTOQNckpFdDVTX1JRYzQ
German: https://drive.google.com/open?id=0B23ji47zTOQNdGdqTkk5QWRTZkU
Spanish: https://drive.google.com/open?id=0B23ji47zTOQNNzd1SDJibm1BWk0

German and Spanish embeddings are pretty good if I remember correctly, but the Dutch ones are bad, I would not use them. I think the Dutch model could easily be 5 F1 better if the embeddings were trained on a bigger corpus (I forgot which corpus we used, but it was really small).

@glample you have deleted my comment should i create a new issue?I need your help regarding the error.

@glample i have created a new issue please have a look #62.

Thank you!@glample

Hi @cosmozhang i wanted to confirm that in order to train the model using these word embeddings the format to run the script is:
python train.py --train dataset/eng.train --dev dataset/eng.testa --test dataset/eng.testb --lr_method adam --tag_scheme iob --pre_emb Skip100 --all_emb 100

@cosmozhang Thanks for the response! One more thing if you are using Windows to run the code please suggest any solution for the issue #62.

I am on Linux and Mac OS. I am not using windows for research. :) @Rabia-Noureen

Oh okay thanks any ways. :)

@Rabia-Noureen By having a quick look, it is not a problem related with OS. I might also encouter it later. I did not tried to use the embedings yet.

@cosmozhang I even tried to train the model without word embeddings but i am still facing that issue. Please let me know if you also face it later on, I am new to python so i dont have any idea how to resolve it. I am using the dataset provided with the code https://github.com/glample/tagger/tree/master/dataset.

@Rabia-Noureen Yes, I also encountered it when I am using the embeddings. I am planning to have a look later. Because I am using pytorch now, so I just want to reproduce the results in pytorch.

Pytorch is much more freindly than theano, though I was heavily on Theano before as well. @Rabia-Noureen

Is this code also working well on pytorch without modification? I guess pytorch is not available for Windows OS.

@cosmozhang Please let me know when ever you are able to resolve the error, i am stuck for the past 2 months. I will wait for your response Thanks for the help...

@Rabia-Noureen yes I deleted your previous post. Sorry, but it was not related to the topic "Where is pretrained word embeddings?". I would appreciate if you don't post your issues at the end of issues created by other users on a different topic. I saw your problem, and I really don't know... This is weird, the loss is barely decreasing. I'll be very busy until Friday, then I will have a look at your problem and see if I can help you debug it.

@glample I am sorry for posting it here but i have been trying to get help by creating issues for the past 2 months so i decided to contact you here when i saw your comment yesterday. No problem i will wait for your response regarding debugging the problem. I shall be thankful if you are able to help.
Thanks

@glample i am waiting for your response on my issue please help me debug the problem i am stuck....
Thanks

Can you send me an email with your exact settings, problem, and what you have tried to fix it so far?

@Rabia-Noureen Just do this:
chmod +x evaluation/conlleval

@glample sure i will send you the email please tell me your email address...

@cosmozhang chmod is not working on windows i have tried to find an alternate i found Attrib so i am going to try Attrib +x evaluation/conlleval. Has it solved the issue for you?

@Rabia-Noureen Yes! Why not try to use Linux? It is super convenient to do so.

@glample is this your exact email firstname.lastname@gmail.com?I have tried to send the email but it failed to deliver. Please check it again....

@cosmozhang yes i was thing about it. Some one suggested me to use docker for windows. Do you have any idea can i install the linux environment on docker and use my gpu settings from windows? Or i will have to go through all the installations and python and gpu configurations again on linux? As i dont want to go through all the installations all over again, its very time consuming i have to report my progress to the supervisor next week....

@Rabia-Noureen replace "firstname" and "lastname" by my real first name and last name..

@glample thanks i have just sent the email.

@cosmozhang I have installed ubuntu 16.04 on virtualbox. Now the tagger is working fine but i want to use GoogleNews-vectors-negative300.bin.gz and glove.840B.300d.zip word embeddings in order to train my model. I am unable to load and use them in Python by extracting with normal extracting software. I get this error with GoogleNews-vectors-negative300.bin.

Skip100 is working fine because its not in compressed form.
Can you please help me how can i extract and use the embeddings? Links are down below
https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit
https://nlp.stanford.edu/projects/glove/
I also tried to use them without extracting but failed.

(my_env) acer@acer:~/tagger$ python train.py --train dataset/eng.train --dev dataset/eng.testa --test dataset/eng.testb --lr_method adam --tag_scheme iob --pre_emb GoogleNews-vectors-negative300.bin --all_emb 300
Model location: ./models/tag_scheme=iob,lower=False,zeros=False,char_dim=25,char_lstm_dim=25,char_bidirect=True,word_dim=100,word_lstm_dim=100,word_bidirect=True,pre_emb=GoogleNews-vectors-negative300.bin,all_emb=False,cap_dim=0,crf=True,dropout=0.5,lr_method=adam
Found 23624 unique words (203621 in total)
Loading pretrained embeddings from GoogleNews-vectors-negative300.bin...
Traceback (most recent call last):
  File "train.py", line 162, in <module>
    ) if not parameters['all_emb'] else None
  File "/home/acer/tagger/loader.py", line 169, in augment_with_pretrained
    for line in codecs.open(ext_emb_path, 'r', 'utf-8')
  File "/home/acer/anaconda2/envs/my_env/lib/python2.7/codecs.py", line 699, in next
    return self.reader.next()
  File "/home/acer/anaconda2/envs/my_env/lib/python2.7/codecs.py", line 630, in next
    line = self.readline()
  File "/home/acer/anaconda2/envs/my_env/lib/python2.7/codecs.py", line 545, in readline
    data = self.read(readsize, firstline=True)
  File "/home/acer/anaconda2/envs/my_env/lib/python2.7/codecs.py", line 492, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x94 in position 0: invalid start byte

@glample any suggestions please?

What is the content of the GoogleNews-vectors-negative300.bin file? Can you copy the first lines of this file here? If this is a binary file then you can't load it with the tagger. The tagger will only load a text file where you have one word embedding per line.

Its a .iso file not a text file so i cant open it. Then how can i load that file? It has been used in a research study with tagger but i dont know how....

Why not using the embeddings we used in the paper instead?
https://drive.google.com/open?id=0B23ji47zTOQNUXYzbUVhdDN2ZmM

Actually i have read in a paper that glove.840B.300d gives the best results for NLP using tagger so i wanted to use that in order to improve the accuracy. I have extracted this file into a text file but it also has some issue. Please have a look at it. Otherwise if it cant be solved i will have to use Skip100 as you have used.

image

Is it working with the Skip100 embeddings?
Pretty sure the Glove ones won't be better. What paper are you referring to?

@glample Yes i have tried Skip100, it is working fine.
I am referring to the paper
Optimal Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks(page 13), It has used Glove word embeddings
And
Improving sentiment analysis via sentence type classification using BiLSTM-CRF and CNN- This paper has used Google news word embeddings for tagger.

This paper does not compare the Glove embeddings with Skip100, and I doubt it will work better. Anyway. Can you copy-paste here the few first lines of the Glove embeddings text file?

@glample All the pretrained vectors (English, Dutch, German, Spanish) are skip-n-gram or some are skip-gram and some are skip-n-gram?

Everything is skip-n-gram.

  1. Can you give me the link of the list of corpa for English, Dutch, German, Spanish that you used to train skip-n-gram pre-trained vector?

  2. Why did you use different size of the pretrained vector for the different languages? I have seen here that you used 100 for english and 64 for the rest of the language.

The embeddings were trained by someone in my lab while I was at CMU (no idea why the dimension is not the same), and I don't have access to the corpora anymore. The corpora are listed in the paper, but I don't know if there is a link to it.

@glample Thank you for your reply

commented

@glample Hi author, I've requested for a skip 100 file through your link above, but I don't have permission, I need you to approve it.