Where is pretrained word embeddings?

Question

Where is pretrained word embeddings?

WaveLi123 opened this issue 7 years ago · comments

wavejkd commented 7 years ago

Cosmo Zhang · Answer 1 · Fri Oct 20 2017 02:46:49 GMT+0800 (China Standard Time)

I am also curious about it. I sent the author an email (He is in Facebook now) but no response.

Guillaume Lample · Answer 2 · Fri Oct 20 2017 06:48:50 GMT+0800 (China Standard Time)

Hi,

Sorry about this, I probably forgot to reply.. Here are the embeddings: https://drive.google.com/open?id=0B23ji47zTOQNUXYzbUVhdDN2ZmM

Best,
Guillaume

Cosmo Zhang · Answer 3 · Fri Oct 20 2017 10:59:07 GMT+0800 (China Standard Time)

Thank you! Is this just English?

Guillaume Lample · Answer 4 · Fri Oct 20 2017 21:08:20 GMT+0800 (China Standard Time)

Yes. Here are the others:

Dutch: https://drive.google.com/open?id=0B23ji47zTOQNckpFdDVTX1JRYzQ
German: https://drive.google.com/open?id=0B23ji47zTOQNdGdqTkk5QWRTZkU
Spanish: https://drive.google.com/open?id=0B23ji47zTOQNNzd1SDJibm1BWk0

German and Spanish embeddings are pretty good if I remember correctly, but the Dutch ones are bad, I would not use them. I think the Dutch model could easily be 5 F1 better if the embeddings were trained on a bigger corpus (I forgot which corpus we used, but it was really small).

Rabia-Noureen · Answer 5 · Sat Oct 21 2017 00:25:44 GMT+0800 (China Standard Time)

@glample you have deleted my comment should i create a new issue?I need your help regarding the error.

Rabia-Noureen · Answer 6 · Sat Oct 21 2017 00:40:57 GMT+0800 (China Standard Time)

@glample i have created a new issue please have a look #62.

Cosmo Zhang · Answer 7 · Sat Oct 21 2017 01:26:12 GMT+0800 (China Standard Time)

Thank you！@glample

Rabia-Noureen · Answer 8 · Sat Oct 21 2017 02:24:49 GMT+0800 (China Standard Time)

Hi @cosmozhang i wanted to confirm that in order to train the model using these word embeddings the format to run the script is:
python train.py --train dataset/eng.train --dev dataset/eng.testa --test dataset/eng.testb --lr_method adam --tag_scheme iob --pre_emb Skip100 --all_emb 100

Cosmo Zhang · Answer 9 · Sat Oct 21 2017 02:40:58 GMT+0800 (China Standard Time)

@Rabia-Noureen I think so.

Rabia-Noureen · Answer 10 · Sat Oct 21 2017 02:46:01 GMT+0800 (China Standard Time)

@cosmozhang Thanks for the response! One more thing if you are using Windows to run the code please suggest any solution for the issue #62.

Cosmo Zhang · Answer 11 · Sat Oct 21 2017 02:47:01 GMT+0800 (China Standard Time)

I am on Linux and Mac OS. I am not using windows for research. :) @Rabia-Noureen

Rabia-Noureen · Answer 12 · Sat Oct 21 2017 02:48:31 GMT+0800 (China Standard Time)

Oh okay thanks any ways. :)

Cosmo Zhang · Answer 13 · Sat Oct 21 2017 02:48:48 GMT+0800 (China Standard Time)

@Rabia-Noureen By having a quick look, it is not a problem related with OS. I might also encouter it later. I did not tried to use the embedings yet.

Rabia-Noureen · Answer 14 · Sat Oct 21 2017 02:53:37 GMT+0800 (China Standard Time)

@cosmozhang I even tried to train the model without word embeddings but i am still facing that issue. Please let me know if you also face it later on, I am new to python so i dont have any idea how to resolve it. I am using the dataset provided with the code https://github.com/glample/tagger/tree/master/dataset.

Cosmo Zhang · Answer 15 · Sat Oct 21 2017 02:54:02 GMT+0800 (China Standard Time)

@Rabia-Noureen Yes, I also encountered it when I am using the embeddings. I am planning to have a look later. Because I am using pytorch now, so I just want to reproduce the results in pytorch.

Cosmo Zhang · Answer 16 · Sat Oct 21 2017 02:55:06 GMT+0800 (China Standard Time)

Pytorch is much more freindly than theano, though I was heavily on Theano before as well. @Rabia-Noureen

Rabia-Noureen · Answer 17 · Sat Oct 21 2017 02:57:20 GMT+0800 (China Standard Time)

Is this code also working well on pytorch without modification? I guess pytorch is not available for Windows OS.

Rabia-Noureen · Answer 18 · Sat Oct 21 2017 02:59:29 GMT+0800 (China Standard Time)

@cosmozhang Please let me know when ever you are able to resolve the error, i am stuck for the past 2 months. I will wait for your response Thanks for the help...

Guillaume Lample · Answer 19 · Sat Oct 21 2017 18:02:02 GMT+0800 (China Standard Time)

@Rabia-Noureen yes I deleted your previous post. Sorry, but it was not related to the topic "Where is pretrained word embeddings?". I would appreciate if you don't post your issues at the end of issues created by other users on a different topic. I saw your problem, and I really don't know... This is weird, the loss is barely decreasing. I'll be very busy until Friday, then I will have a look at your problem and see if I can help you debug it.

Rabia-Noureen · Answer 20 · Sat Oct 21 2017 18:39:08 GMT+0800 (China Standard Time)

@glample I am sorry for posting it here but i have been trying to get help by creating issues for the past 2 months so i decided to contact you here when i saw your comment yesterday. No problem i will wait for your response regarding debugging the problem. I shall be thankful if you are able to help.
Thanks

Rabia-Noureen · Answer 21 · Mon Oct 30 2017 18:20:33 GMT+0800 (China Standard Time)

@glample i am waiting for your response on my issue please help me debug the problem i am stuck....
Thanks

Guillaume Lample · Answer 22 · Tue Oct 31 2017 05:31:41 GMT+0800 (China Standard Time)

Can you send me an email with your exact settings, problem, and what you have tried to fix it so far?

Cosmo Zhang · Answer 23 · Wed Nov 01 2017 00:36:45 GMT+0800 (China Standard Time)

@Rabia-Noureen Just do this:
chmod +x evaluation/conlleval

Rabia-Noureen · Answer 24 · Wed Nov 01 2017 18:04:11 GMT+0800 (China Standard Time)

@glample sure i will send you the email please tell me your email address...

Rabia-Noureen · Answer 25 · Wed Nov 01 2017 18:08:11 GMT+0800 (China Standard Time)

@cosmozhang chmod is not working on windows i have tried to find an alternate i found Attrib so i am going to try Attrib +x evaluation/conlleval. Has it solved the issue for you?

Guillaume Lample · Answer 26 · Wed Nov 01 2017 21:23:11 GMT+0800 (China Standard Time)

firstname.lastname@gmail.com

Cosmo Zhang · Answer 27 · Wed Nov 01 2017 22:51:26 GMT+0800 (China Standard Time)

@Rabia-Noureen Yes! Why not try to use Linux? It is super convenient to do so.

Rabia-Noureen · Answer 28 · Thu Nov 02 2017 04:27:48 GMT+0800 (China Standard Time)

@glample is this your exact email firstname.lastname@gmail.com?I have tried to send the email but it failed to deliver. Please check it again....

Rabia-Noureen · Answer 29 · Thu Nov 02 2017 04:34:14 GMT+0800 (China Standard Time)

@cosmozhang yes i was thing about it. Some one suggested me to use docker for windows. Do you have any idea can i install the linux environment on docker and use my gpu settings from windows? Or i will have to go through all the installations and python and gpu configurations again on linux? As i dont want to go through all the installations all over again, its very time consuming i have to report my progress to the supervisor next week....

Guillaume Lample · Answer 30 · Thu Nov 02 2017 19:10:07 GMT+0800 (China Standard Time)

@Rabia-Noureen replace "firstname" and "lastname" by my real first name and last name..

Rabia-Noureen · Answer 31 · Thu Nov 02 2017 19:18:28 GMT+0800 (China Standard Time)

@glample thanks i have just sent the email.

Rabia-Noureen · Answer 32 · Tue Nov 14 2017 00:18:27 GMT+0800 (China Standard Time)

@cosmozhang I have installed ubuntu 16.04 on virtualbox. Now the tagger is working fine but i want to use GoogleNews-vectors-negative300.bin.gz and glove.840B.300d.zip word embeddings in order to train my model. I am unable to load and use them in Python by extracting with normal extracting software. I get this error with GoogleNews-vectors-negative300.bin.

Skip100 is working fine because its not in compressed form.
Can you please help me how can i extract and use the embeddings? Links are down below
https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit
https://nlp.stanford.edu/projects/glove/
I also tried to use them without extracting but failed.

(my_env) acer@acer:~/tagger$ python train.py --train dataset/eng.train --dev dataset/eng.testa --test dataset/eng.testb --lr_method adam --tag_scheme iob --pre_emb GoogleNews-vectors-negative300.bin --all_emb 300
Model location: ./models/tag_scheme=iob,lower=False,zeros=False,char_dim=25,char_lstm_dim=25,char_bidirect=True,word_dim=100,word_lstm_dim=100,word_bidirect=True,pre_emb=GoogleNews-vectors-negative300.bin,all_emb=False,cap_dim=0,crf=True,dropout=0.5,lr_method=adam
Found 23624 unique words (203621 in total)
Loading pretrained embeddings from GoogleNews-vectors-negative300.bin...
Traceback (most recent call last):
  File "train.py", line 162, in <module>
    ) if not parameters['all_emb'] else None
  File "/home/acer/tagger/loader.py", line 169, in augment_with_pretrained
    for line in codecs.open(ext_emb_path, 'r', 'utf-8')
  File "/home/acer/anaconda2/envs/my_env/lib/python2.7/codecs.py", line 699, in next
    return self.reader.next()
  File "/home/acer/anaconda2/envs/my_env/lib/python2.7/codecs.py", line 630, in next
    line = self.readline()
  File "/home/acer/anaconda2/envs/my_env/lib/python2.7/codecs.py", line 545, in readline
    data = self.read(readsize, firstline=True)
  File "/home/acer/anaconda2/envs/my_env/lib/python2.7/codecs.py", line 492, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x94 in position 0: invalid start byte

Rabia-Noureen · Answer 33 · Tue Nov 14 2017 00:22:08 GMT+0800 (China Standard Time)

@glample any suggestions please?

Guillaume Lample · Answer 34 · Tue Nov 14 2017 00:39:48 GMT+0800 (China Standard Time)

What is the content of the GoogleNews-vectors-negative300.bin file? Can you copy the first lines of this file here? If this is a binary file then you can't load it with the tagger. The tagger will only load a text file where you have one word embedding per line.

Rabia-Noureen · Answer 35 · Tue Nov 14 2017 00:49:31 GMT+0800 (China Standard Time)

Its a .iso file not a text file so i cant open it. Then how can i load that file? It has been used in a research study with tagger but i dont know how....

Guillaume Lample · Answer 36 · Tue Nov 14 2017 00:50:19 GMT+0800 (China Standard Time)

Why not using the embeddings we used in the paper instead?
https://drive.google.com/open?id=0B23ji47zTOQNUXYzbUVhdDN2ZmM

Rabia-Noureen · Answer 37 · Tue Nov 14 2017 01:19:04 GMT+0800 (China Standard Time)

Actually i have read in a paper that glove.840B.300d gives the best results for NLP using tagger so i wanted to use that in order to improve the accuracy. I have extracted this file into a text file but it also has some issue. Please have a look at it. Otherwise if it cant be solved i will have to use Skip100 as you have used.

Guillaume Lample · Answer 38 · Tue Nov 14 2017 03:36:07 GMT+0800 (China Standard Time)

Is it working with the Skip100 embeddings?
Pretty sure the Glove ones won't be better. What paper are you referring to?

Rabia-Noureen · Answer 39 · Wed Nov 15 2017 00:20:49 GMT+0800 (China Standard Time)

@glample Yes i have tried Skip100, it is working fine.
I am referring to the paper
Optimal Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks(page 13), It has used Glove word embeddings
And
Improving sentiment analysis via sentence type classification using BiLSTM-CRF and CNN- This paper has used Google news word embeddings for tagger.

Guillaume Lample · Answer 40 · Wed Nov 15 2017 03:11:21 GMT+0800 (China Standard Time)

This paper does not compare the Glove embeddings with Skip100, and I doubt it will work better. Anyway. Can you copy-paste here the few first lines of the Glove embeddings text file?

M Saiful Bari · Answer 41 · Fri Nov 24 2017 15:01:20 GMT+0800 (China Standard Time)

@glample All the pretrained vectors (English, Dutch, German, Spanish) are skip-n-gram or some are skip-gram and some are skip-n-gram?

Guillaume Lample · Answer 42 · Fri Nov 24 2017 20:15:52 GMT+0800 (China Standard Time)

Everything is skip-n-gram.

M Saiful Bari · Answer 43 · Fri Dec 01 2017 05:49:16 GMT+0800 (China Standard Time)

Can you give me the link of the list of corpa for English, Dutch, German, Spanish that you used to train skip-n-gram pre-trained vector?
Why did you use different size of the pretrained vector for the different languages? I have seen here that you used 100 for english and 64 for the rest of the language.

Guillaume Lample · Answer 44 · Fri Dec 01 2017 08:00:00 GMT+0800 (China Standard Time)

The embeddings were trained by someone in my lab while I was at CMU (no idea why the dimension is not the same), and I don't have access to the corpora anymore. The corpora are listed in the paper, but I don't know if there is a link to it.

M Saiful Bari · Answer 45 · Sun Dec 03 2017 16:17:07 GMT+0800 (China Standard Time)

@glample Thank you for your reply

wcw15 · Answer 46 · Sun Oct 23 2022 22:01:22 GMT+0800 (China Standard Time)

@glample Hi author, I've requested for a skip 100 file through your link above, but I don't have permission, I need you to approve it.