NLP Notes

Attention

Additive/concat Attention

Resources: [Paper], [Illustrative Intro], [TF2 Implementation]

Multiplicative Attention

Resources: [Paper], [Illustrative Intro], [TF2 Implementation]

Multi-head Self Attention / Transformer

Resources: [Paper], [Illustrative Intro], [TF2 Implementation], [TF2 Implementation by Google]

Subword Tokenization

Summary: HuggingFace Tokenizer Summary

Implementation: HuggingFace Tokenizer, Google SentencePiece

Unigram Language Model (ULM)

assume all subword occurence are independent and subword sequence is produced by the product of subword occurrence probabilities

optimize for whole sentence likelihood probability (Viterbi Algorithm)

both WP and ULM leverages language model to build subword vocabulary

Byte Pair Encoding (BPE)

start from character level, form a new subword based on the next highest frequency pair until reaching desired vocabulary size or the next highest frequency is 1

used in GPT-2, RoBERTa, see Git Issue for implementation

tokenizers.CharBPETokenizer: OpenAIGPTTokenizerFast,

tokenizers.ByteLevelBPETokenizer: GPT2TokenizerFast, RobertaTokenizerFast, LongformerTokenizerFast

WordPiece (WP)

similar to BPE but "choose the new word unit out of all possible ones that increase the likelihood on the training data the most when added to the model"

define log P(sentence) = Σ log P(token_i)
when merge adjacent tokens x and y into z
the change in likelihood is log P(token_z) - (log P(token_x) + log P(token_y))

tokenizers.BertWordPieceTokenizer: BertTokenizerFast, DistilBertTokenizerFast, ElectraTokenizerFast, RetriBertTokenizerFast, MobileBertTokenizerFast

Industrial Application

Google Neural Machine Translation System

Concept applied: Additive/concat attention, Residual connection, Vanilla dropout
Resources: [Paper], [Illustrative Intro], [TF2 Implementation], [Torch Implementation]

BERT: Bidirectional Encoder Representations from Transformers

Resources: [Paper]

Probabilistic Graph

Conditional Random Field

Resources: [Introduction to CRF], [CRF vs MRF], [CRF for Multi-label Classification] [Tensorflow CRF];

Bi-LSTM CRF

Resources: [Paper], [TF1.0 Implementation by Scofield]

Label Attention Network

Resources: [Paper], [Torch Implementation by Author]

Modeling Tricks

Transformer Training

Pre-Layer Normalization Transformer: [Paper]
Training Tips for Transformer: [Paper]

Recurrent Neural Network Normalization

Resources: [Methodology Overview], [Layer Normalization]
Experience: use BatchNormalization or LayerNormalization after each RNN layer

Recurrent Neural Network Dropout

Resources: [Methodology Overview], [Vanilla Dropout], [Variational Dropout], [Recurrent Dropout]
Experience: set dropout ratio between 0.1 and 0.3, begin with vanilla dropout

ywu94 / NLP-Notes

NLP Notes

Attention

Subword Tokenization

Industrial Application

Probabilistic Graph

Modeling Tricks

About

Languages