-> https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/
As it is pre trained it expects the input data in a specific format
- A special token, [SEP], to mark the end of a sentence, or the separation between two sentences
- A special token, [CLS], at the beginning of our text. This token is used for classification tasks, but BERT expects it no matter what your application is.
- Tokens that conform with the fixed vocabulary used in BERT
- The Token IDs for the tokens, from BERT’s tokenizer
- Mask IDs to indicate which elements in the sequence are tokens and which are padding elements
- Segment IDs used to distinguish different sentences
- Positional Embeddings used to show token position within the sequence
Luckily the transformer interfaces takes care of the requirements by using the tokenizer.encode_plus function
2 Sentence Input:
- [CLS] The man went to the store. [SEP] He bought a gallon of milk. 1 Sentence Input:
- [CLS] The man went to the store. [SEP]
The tokens [CLS] and [SEP] are always required
text = "Here is the sentence I want embeddings for."
produces
['[CLS]', 'here', 'is', 'the', 'sentence', 'i', 'want', 'em', '##bed', '##ding', '##s', 'for', '.', '[SEP]']
"embedding" is represented as
['em', '##bed', '##ding', '##s']
After breaking the text into tokens, we have to convert the sentence from a list of strings to a list of vocabulary indeces.
From here on, we'll use the below example sentence, which contains two instances of the word "bank" with different meanings.
Example
- After stealing money from the bank vault, the bank robber was seen
- fishing on the Mississippi river bank.