ywu94 / NLP-Notes

NLP learning notes, including classic papers, algorithm implementations, modeling tricks, and notes.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

NLP Notes

Attention

Additive/concat Attention

Multiplicative Attention

Multi-head Self Attention / Transformer

Subword Tokenization

  • Summary: HuggingFace Tokenizer Summary
  • Implementation: HuggingFace Tokenizer, Google SentencePiece
  • Unigram Language Model (ULM)
    • assume all subword occurence are independent and subword sequence is produced by the product of subword occurrence probabilities
    • optimize for whole sentence likelihood probability (Viterbi Algorithm)
    • both WP and ULM leverages language model to build subword vocabulary
  • Byte Pair Encoding (BPE)
    • start from character level, form a new subword based on the next highest frequency pair until reaching desired vocabulary size or the next highest frequency is 1
    • used in GPT-2, RoBERTa, see Git Issue for implementation
    • tokenizers.CharBPETokenizer: OpenAIGPTTokenizerFast,
    • tokenizers.ByteLevelBPETokenizer: GPT2TokenizerFast, RobertaTokenizerFast, LongformerTokenizerFast
  • WordPiece (WP)
    • similar to BPE but "choose the new word unit out of all possible ones that increase the likelihood on the training data the most when added to the model"
      • define log P(sentence) = Σ log P(token_i)
        when merge adjacent tokens x and y into z
        the change in likelihood is log P(token_z) - (log P(token_x) + log P(token_y))
    • tokenizers.BertWordPieceTokenizer: BertTokenizerFast, DistilBertTokenizerFast, ElectraTokenizerFast, RetriBertTokenizerFast, MobileBertTokenizerFast

Industrial Application

Google Neural Machine Translation System

Concept applied: Additive/concat attention, Residual connection, Vanilla dropout
Resources: [Paper][Illustrative Intro][TF2 Implementation][Torch Implementation]

BERT: Bidirectional Encoder Representations from Transformers

Resources: [Paper]

Probabilistic Graph

Conditional Random Field

Resources: [Introduction to CRF][CRF vs MRF][CRF for Multi-label Classification]  [Tensorflow CRF];

Bi-LSTM CRF

Resources: [Paper][TF1.0 Implementation by Scofield]

Label Attention Network

Resources: [Paper][Torch Implementation by Author]

Modeling Tricks

Transformer Training

Pre-Layer Normalization Transformer: [Paper]
Training Tips for Transformer: [Paper]

Recurrent Neural Network Normalization

Resources: [Methodology Overview][Layer Normalization]
Experience: use BatchNormalization or LayerNormalization after each RNN layer

Recurrent Neural Network Dropout

Resources: [Methodology Overview][Vanilla Dropout][Variational Dropout][Recurrent Dropout]
Experience: set dropout ratio between 0.1 and 0.3, begin with vanilla dropout

About

NLP learning notes, including classic papers, algorithm implementations, modeling tricks, and notes.

License:MIT License


Languages

Language:Python 100.0%