google-research / bert

TensorFlow code and pre-trained models for BERT

Home Page:https://arxiv.org/abs/1810.04805

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to get the word embedding after pre-training?

mfxss opened this issue · comments

commented

Hi,
I am excited on this great model. And I want to get the word embedding . Where shold I find the file from output or should I change to code to do this?
Thanks,
Yuguang

If you want to get the contextual embeddings (like ELMo) see the section here.

If you want the actual word embeddings, the word->id mapping is just the index of the row in vocab.txt, and the embedding matrix is in bert_model.ckpt with the variable name bert/embeddings/word_embeddings.

commented

And I download your released model of chinese_L-12_H-768_A-12. In vocab.txt, I found some token such as
[unused1] [CLS][SEP][MASK] <S> <T> .
What do these tokens mean?

The [CLS], [SEP] and [MASK] tokens are used as described in the paper and README. The [unused] tokens were not used in our model and are randomly initialized.

commented

What is your training data of chinese_L-12_H-768_A-12? And what is it's size?

It's Chinese wikipedia with both Traditional and Simplified characters.

Hello @mfxss ,
Not sure if you still have problem to get the word embedding from BERT. I implement a BERT embedding library which makes you can get word embedding in a programatic way.

https://github.com/imgarylai/bert-embedding

Because I'm working closely with mxnet & gluonnlp team, my implementation is done by using mxnet and gluonnlp. However, I am trying to implement it in all other different frameworks.

Hope my works can help you.

Hey guys, if you don't want to install an extra module, here is an example:

BERT_PATH = 'HOME_DIR/bert_en_uncased_L-12_H-768_A-12'

import tensorflow as tf
imported = tf.saved_model.load(BERT_PATH)

for i in imported.trainable_variables:
    if i.name == 'bert_model/word_embeddings/embeddings:0':
        embeddings = i

And embeddings is the tensor of word embedding that you want!

Hi @jacobdevlin-google Thanks for the pointers. I see the output with the extract_features.py gives subword representations. I'm sure to be missing something but my question is how can we get a word (not subword) representation instead? Thanks in advance for your help!

Hi @jacobdevlin-google Thanks for the pointers. I see the output with the extract_features.py gives subword representations. I'm sure to be missing something but my question is how can we get a word (not subword) representation instead? Thanks in advance for your help!

Excuse me did you find a solution for word not subword , please