How to get the word embedding after pre-training?

Question

How to get the word embedding after pre-training?

mfxss opened this issue 6 years ago · comments

Hi,
I am excited on this great model. And I want to get the word embedding . Where shold I find the file from output or should I change to code to do this?
Thanks,
Yuguang

Jacob Devlin · Answer 1 · Tue Nov 06 2018 11:04:16 GMT+0800 (China Standard Time)

If you want to get the contextual embeddings (like ELMo) see the section here.

If you want the actual word embeddings, the word->id mapping is just the index of the row in vocab.txt, and the embedding matrix is in bert_model.ckpt with the variable name bert/embeddings/word_embeddings.

mfxss · Answer 2 · Tue Nov 06 2018 11:50:48 GMT+0800 (China Standard Time)

And I download your released model of chinese_L-12_H-768_A-12. In vocab.txt, I found some token such as
[unused1] [CLS][SEP][MASK] <S> <T> .
What do these tokens mean?

Jacob Devlin · Answer 3 · Tue Nov 06 2018 11:52:39 GMT+0800 (China Standard Time)

The [CLS], [SEP] and [MASK] tokens are used as described in the paper and README. The [unused] tokens were not used in our model and are randomly initialized.

mfxss · Answer 4 · Tue Nov 06 2018 15:40:17 GMT+0800 (China Standard Time)

What is your training data of chinese_L-12_H-768_A-12? And what is it's size?

Jacob Devlin · Answer 5 · Wed Nov 07 2018 02:20:14 GMT+0800 (China Standard Time)

It's Chinese wikipedia with both Traditional and Simplified characters.

Gary Lai · Answer 6 · Mon Feb 11 2019 03:10:02 GMT+0800 (China Standard Time)

Hello @mfxss ,
Not sure if you still have problem to get the word embedding from BERT. I implement a BERT embedding library which makes you can get word embedding in a programatic way.

https://github.com/imgarylai/bert-embedding

Because I'm working closely with mxnet & gluonnlp team, my implementation is done by using mxnet and gluonnlp. However, I am trying to implement it in all other different frameworks.

Hope my works can help you.

Chenning Yu · Answer 7 · Sun Jan 05 2020 06:43:33 GMT+0800 (China Standard Time)

Hey guys, if you don't want to install an extra module, here is an example:

BERT_PATH = 'HOME_DIR/bert_en_uncased_L-12_H-768_A-12'

import tensorflow as tf
imported = tf.saved_model.load(BERT_PATH)

for i in imported.trainable_variables:
    if i.name == 'bert_model/word_embeddings/embeddings:0':
        embeddings = i

And embeddings is the tensor of word embedding that you want!

Arjun Raj Rajanna · Answer 8 · Thu Aug 20 2020 05:27:36 GMT+0800 (China Standard Time)

Hi @jacobdevlin-google Thanks for the pointers. I see the output with the extract_features.py gives subword representations. I'm sure to be missing something but my question is how can we get a word (not subword) representation instead? Thanks in advance for your help!

mathshangw · Answer 9 · Sun Jan 09 2022 16:19:49 GMT+0800 (China Standard Time)

Hi @jacobdevlin-google Thanks for the pointers. I see the output with the extract_features.py gives subword representations. I'm sure to be missing something but my question is how can we get a word (not subword) representation instead? Thanks in advance for your help!

Excuse me did you find a solution for word not subword , please