cs224n cs224n-assignment-solutions cs224nwinter2019 language-models dependency-parsing machine-translation question-answering

Stanford CS224n Natural Language Processing with Deep Learning

The course notes about Stanford CS224n Winter 2019 (using PyTorch)

Some general notes I'll write in my Deep Learning Practice repository

Course Related Links

Schedule

Week	Lectures	Assignments
2019/7/1~7/7	Introduction and Word Vectors, Word Vectors 2 and Word Senses	Assignment 1
2019/7/8~7/14	Word Window Classification, Neural Networks, and Matrix Calculus	-
2019/7/15~7/21	Backpropagation and Computation Graphs	Assignment 2
2019/10/21~10/27	Linguistic Structure: Dependency Parsing	-
2019/10/28~11/3	Recurrent Neural Networks and Language Models	Assignment 3
2019/11/4~11/10	Vanishing Gradients and Fancy RNNs, Machine Translation, Seq2Seq and Attention	Assignment 4
2019/11/11~11/17	Transformers and Self-Attention For Generative Models, Modeling contexts of use: Contextual Representations and Pretraining	-
2019/11/18~11/24	Practical Tips for Projects, Question Answering, ConvNets for NLP, Subword Models	Assignment 5
2019/11/25~12/1	[Project: Question Answering], Natural Language Generation	-
2019/12/2~12/8	[Project: Question Answering]	-
2019/12/9~12/15	Reference in Language and Coreference Resolution	-
2020/1/13~1/19	Multitask Learning: A general model for NLP?	-

Lecture

Assignment

Project

Question Answering (Default)
Summerization

Paper reading

Derivation

backprop

Lectures

Lecture 1: Introduction and Word Vectors

slides
notes
readings
Gensim example
- preparing embedding: download this zip file and unzip the glove.6B.*d.txt files into embedding/GloVe directory

Outline

Introduction to Word2vec
- objective function
- prediction function
- how to train it
Optimization: Gradient Descent & Chain Rule

Lecture 2: Word Vectors 2 and Word Senses

Outline

More detail to Word2vec
- Skip-grams (SG)
- Continuous Bag of Words (CBOW)
Similarity visualization
Co-occurrence matrix + SVD (LSA) vs. Embedding
Evaluation on word vectors
- Intrinsic
- Extrinsic

CS 168 The Modern Algorithmic Toolbox - for SVD

Lecture 3: Word Window Classification, Neural Networks, and Matrix Calculus

slides
- ConvNetJS
matrix calculus
notes
readings
- CS231n notes on backprop
- Review of differential calculus
additional readings
- Natural Language Processing (Almost) from Scratch

Outline

Some basic idea of NLP tasks
Matrix Calculus
- Jacobian Matrix
- Shape convention
Loss
- Softmax
- Cross-entropy

Lecture 4: Backpropagation and Computation Graphs

Outline

Computational Graph
Backprop & Forwardprop
Introducing regularization to prevent overfitting
Non-linearity: activation functions
Practical Tips
- Parameter Initialization
- Optimizers
  - plain SGD
  - more sophisticated adaptive optimizers
- Learing Rates

Lecture 5: Linguistic Structure: Dependency Parsing

slides
notes
readings
- Incrementality in Deterministic Dependency Parsing
- A Fast and Accurate Dependency Parser using Neural Networks
- Dependency Parsing
- Globally Normalized Transition-Based Neural Networks
- Universal Stanford Dependencies: A cross-linguistic typology
- Universal Dependencies website

Outline

Methods of Dependency Parsing
- Dynamic Programming
  - complexity O(n³)
- Graph Algorithm
  - create a minimum spanning tree for a sentence
- Constraint Satisfaction
  - edges are eliminated that don't satisfy hard constraints
- Transition-based Parsing / Deterministic Dependency Parsing
  - greedy choice of attachments guided by machine learning classifier
  - complexity O(n)
Operations of the Shift-reduce Parser
- Shift
- Left-Arc
- Right-Arc
Attachment Errors
- Prepositional Phrase Attachment Errors
- Verb Phrase Attachment Errors
- Modifier Attachment Errors
- Coordination Attachment Errors

mentioned CS103, CS228

Lecture 6: The probability of a sentence? Recurrent Neural Networks and Language Models

slides
notes
readings
- N-gram Language Models (textbook chapter)
- The Unreasonable Effectiveness of Recurrent Neural Networks (blog post overview)
- Sequence Modeling: Recurrent and Recursive Neural Nets (Sections 10.1 and 10.2)
- On Chomsky and the Two Cultures of Statistical Learning

N-gram Language Model

Fixed-window Neural Language Model

vanilla RNN

Language Modeling: the task of predicting the next word, given the words so far
Language Model: a system that produces the probability distribution for the next candidate word
Conditional Language Modeling: the task of predicting the next word, given the words so far, and also some other input x
- Machine Translation (x=source sentence, y=target sentence)
- Summarization (x=input text, y=summarized text)
- Dialogue (x=dialogue history, y=next utterance)
- ...

Lecture 7: Vanishing Gradients and Fancy RNNs

slides
notes - same as lecture 6
readings
- Sequence Modeling: Recurrent and Recursive Neural Nets - (textbook sections 10.3, 10.5, 10.7-10.12)
- Learning long-term dependencies with gradient descent is difficult (one of the original vanishing gradient papers)
- On the difficulty of training Recurrent Neural Networks (proof of vanishing gradient problem)
- Vanishing Gradients Jupyter Notebook (demo for feedforward networks)
- Understanding LSTM Networks (blog post overview)

Vanishing gradient =>

LSTM and GRU

Lecture 8: Machine Translation, Seq2Seq and Attention

slides
notes
readings
- Statistical Machine Translation slides, CS224n 2015 (lectures 2/3/4)
- Statistical Machine Translation (book by Philipp Koehn)
- BLEU (a Method for Automatic Evaluation of Machine Translate) (original paper)
- Sequence to Sequence Learning with Neural Networks (original seq2seq NMT paper)
- Sequence Transduction with Recurrent Neural Networks (early seq2seq speech recognition paper)
- Neural Machine Translation by Jointly Learning to Align and Translate (original seq2seq+attention paper)
- Attention and Augmented Recurrent Neural Networks (blog post overview)
- Massive Exploration of Neural Machine Translation Architectures (practical advice for hyperparameter choices)

Training method: Teacher Forcing

During training, we feed the gold (aka reference) target sentence into the decoder, regardless of what the decoder predicts.

During testing (decoding): Beam Search vs. Greedy Decoding

Decoding Algorithm: an algorithm you use to generate text from your language model

Greedy Decoding => lack of backtracking

on each step take the most probable word (i.e. argmax)

use that as the next word, and feed it as input on the next step

keep going until you produce <END> or reach some max length

Beam Search: aims to find high-probability sequence by tracking multiple possible sequences at once

on each step of decoder, keep track of the k (beam size) most probable partial sequences (hypotheses)

after you reach some stopping criterion (get n complete hypotheses (each stop when reach max depth, produce <END>)), choose the sequence with the highest probability (with score normalization)

Lecture 13: Modeling contexts of use: Contextual Representations and Pretraining

ELMo, BERT

Lecture 14: Transformers and Self-Attention For Generative Models

guest lecture

Self-attention, Transformer

Lecture 9: Practical Tips for Final Projects

slides
notes - Good notes about finding existing research, datasets and tasks
readings
- Practical Methodology (Deep Learning book chapter)

Vanishing Gradient, LSTM, GRU (again)

Lecture 10: Question Answering and the Default Final Project

slides
notes

some more Attention, mentioned CS 276: Information Retrieval and Web Search

Quick notes about QA:

QA types
- Factoid QA: answer is an NER (some clear semantic type entity)
- Extractive QA: answer must be a span (a sub-sequence of words) in the passage
  - e.g. SQuAD 1.X
  - defect: all questions have an answer in the paragraph => turned into a kind of a ranking task
- Extractive QA + NoAnswer: some question might have no answer in the paragraph
  - e.g. SQuAD 2.0
  - limitation:
    - only span-based answers (no yes/no, counting, implicit why)
    - ...
- Open-domain QA
  - e.g. DrQA
  - [1704.00051] Reading Wikipedia to Answer Open-Domain Questions

Lecture 11: ConvNets for NLP

mentioned CS231n: Convolutional Neural Networks for Visual Recognition

Lot of common technique (nowadays)

Model Comparison
- Bag of Vectors: take the word vectors and averaging them
  - good baseline
  - better have followed by a few ReLU
- Window Model
  - good for single word classification (for problems that don't need wide context e.g. POS, NER)
- CNNs
  - good for classification
  - need zero padding for shorter phrases
  - easy to parallelize
- RNNs
  - cognitively plausible (reading from left to right)
  - not best for classification (if just use last state)
  - much slower than CNNs
  - good for sequence tagging
  - great for language models and can be amazing with attention mechanism
Dropout
- for regularization => prevent overfitting
- gives 2~4% accuracy improvement
Gated units used vertically: shortcut connection (is needed for very deep networks to work)
- Residual block
- Highway block
BatchNorm
- Z-transform: zero mean and unit variance

Lecture 12: Information from parts of words: Subword Models

fastText

Lecture 15: Natural Language Generation

slides

Outline

Decoding mehtods
- Greedy decoding
- Beam search
- Sampling-based decoding: good for open-ended/creative generation (poetry, stories)
  - Pure sampling: like greedy decoding, but sample instead of argmax
  - Top-n sampling: like pure sampling, but truncate the probability distribution

Softmax temperature: another way to control diversity

NLG Tasks
- Machine Translation
- (Abstractive) Summarization
  - Evaluation: ROUGE
- Dialogue
  - chit-chat
  - task-based
- Creative writing
  - Storytelling
  - Poetry-generation
- Freefrom Question Answering
- Image captioning
- ...
NLG Evaluation Metrics
- Word overlap based metrics
  - BLEU
  - ROUGE
  - METEOR
  - F1
  - ...
- (Perplexity) doesn't tell you anything about generation
- Word embedding based metrics
- Human evaluation

Lecture 16: Reference in Language and Coreference Resolution

slides

Outline

Coreference Resolution: identify all mentions that refer to the same real world entity
- Application
  - Full text understanding
  - Machine translation
  - Dialogue systems
- Step (Pipelined system)
  1. Detect the mentions => using other NLP system
  2. Cluster the mentions
- End-to-end system
- Model
  - Rule-based (pronomial anaphora resolution)
    - can't solve sentences which have identical syntactic structure
  - Mention Pair
    - binary classifier: coreferent or not (for every pair of mentions)
    - custering
      1. pick a threshold and add coreference links when above
      2. take the transitive closure to get the clustering
  - Mention Ranking
    1. assign each mention its highest scoring candidate antecedent
    2. add dummy NA mention at the front (for decline linking)
  - Clustering
    - Agglomerative clustering
      1. start with each mention in its own singleton cluster
      2. merge a pair of clusters at each step
Mention: span of text referring to some entity
1. pronouns
  - capture use a part-of-speech tagger
2. named entities
  - capture use a NER system
3. noun phrases
  - capture use a parser (especially a constituency parser)
Linguistics stuff
- Coreference: two mentions refer to the same entity in the world
- Anaphora: when a term (anaphor) refers to another term (antecedent)
  - Pronominal Anaphora (Coreferential one)
  - Bridging Anaphora (Not Coreferential)
- Cataphora: when antecedent comes after (usually before) the anaphor

Lecture 17: Multitask Learning: A general model for NLP

slides

Outline

Natural Language Decathlon (decaNLP)
- => reduce subtask to more general task => transfer knowledge from the other task => maybe then we can do Zero-shot Learning / Transfer Learning
- salesforce/decaNLP: The Natural Language Decathlon: A Multitask Challenge for NLP
3 equivalent supertasks of NLP
- Language Modeling
  - predict next word
  - embedding...
- Question Answering Formalism (Multitask Learning as QA) => Training single question answering model for multiple NLP tasks (aka. questions)
  - Question Answering
  - Machine Translation
  - Summarization
  - Natural Language Inference
  - Sentiment Classification
  - Semantic Role Labeling
  - Relation Extraction
  - Dialogue
  - Semantic Parsing
  - Commonsense Reasoning
- Dialogue
Framework for tackling
- more general language understanding
- multitask learning
- domain adaptation
- transfer learning
- weight sharing, pre-training, fine-tuning (towards ImageNet-CNN of NLP)
- zero-shot learning

Assignments

Assignment 1: Exploring Word Vectors

Outline

co-occurrance matrix + Truncated SVD
pre-trained word2vec

Assignment 2: word2vec

handout
directory
- written
- code
  - python3 word2vec.py check the correctness of word2vec
  - python3 sgd.py check the correctness of SGD
  - ./get_datasets.sh; python3 run.py - training took 9480 seconds

Outline

Train word2vec with skip-gram model and negative sampling using stochastic gradient descent

Data processing in cs224n assignment 2 word2vec (2019)

Others' Answer

Assignment 3: Dependency Parsing

A Fast and Accurate Dependency Parser using Neural Networks

handout
directory
- written
- code
  - python3 parser_transitions.py part_c check the corretness of transition mechanics
  - python3 parser_transitions.py part_d check the correctness of minibatch parse
  - python3 run.py
    - set debug=True to test the process (debug_out.log)
    - set debug=False to train on the entire dataset (train_out.log)
      - best UAS on the dev set: 88.79 (epoch 9/10)
      - best UAS on the test set: 89.27

Outline

Adam Optimizer
Dropout
Neural Transition-based Dependency Parser (a shift-reduce parser)

Others' Answer

CS224N-2019/Assignment/a3 at master · Luvata/CS224N-2019

Assignment 4: Neural Machine Translation

handout
Asure Guide (Google Drive), Practical Guide to VMs (Google Drive)
directory
- written - BLEU Verify
  - A Gentle Introduction to Calculating the BLEU Score for Text in Python
    - nltk.translate.bleu_score
  - Tilde Interactive BLEU score evaluator - input txt
- code
  - python3 sanity_check.py 1d check the correctness of encode procedure (including utils.pad_sents)
  - python3 sanity_check.py 1e check the correctness of decode procedure (including step function)
  - Preprocess the training data by sh run.sh vocab to get the necessary vocabulary
  - Test the functionality on CPU: train sh run.sh train_local; test sh run.sh test_local
    - (speed about 100 words/sec on Macbook Air 1.8GHz i5 CPU)
  - Train and Test with GPU: train sh run.sh train; test sh run.sh test
    - (speed about 5000 words/sec on Nvidia GeForce GTX 1080 GPU)
    - (this will generate model image model.bin and optimizers' state model.bin.optim)
    - early stop on epoch 13, iter 86000, cum. loss 28.94, cum. ppl 5.13 cum. examples 64000 => Corpus BLEU: 22.36579929869114
  - Compare output with references vim -dO outputs/test_outputs.txt en_es_data/test.en
  - Open three of them at the same time vim -o outputs/test_outputs.txt en_es_data/test.en en_es_data/test.es

Other's Answer

pcyin/pytorch_nmt: A neural machine translation model in PyTorch

Assignment 5: Character-based Neural Machine Translation

build a character level ConvNet

handout
directory
- written
- code
  - Create the correct vocab files sh run.sh vocab
    - vocab_tiny_q1.json: generated vocabulary, source 132 words, target 132 words
      - source: number of word types: 128, number of word types w/ frequency >= 1: 128
      - target: number of word types: 130, number of word types w/ frequency >= 1: 130
    - vocab_tiny_q2.json: generated vocabulary, source 26 words, target 32 words
      - source: number of word types: 128, number of word types w/ frequency >= 2: 22
      - target: number of word types: 130, number of word types w/ frequency >= 2: 30
    - vocab.json: generated vocabulary, source 50004 words, target 50002 words
      - source: number of word types: 172418, number of word types w/ frequency >= 2: 80623
      - target: number of word types: 128873, number of word types w/ frequency >= 2: 64215
  - Sanity Checks python3 sanity_check.py [part]
    - pre-defined: (1e, 1f, 1j, 2a, 2b, 2c, 2d)
    - customized: (1g, 1h, 1i, 1j)
  - Test the first part code at local
    - sh run.sh train_local_q1 - this will run 100 epoches
      - epoch 100, iter 500, cum. loss 0.31, cum. ppl 1.02 cum. examples 200
      - validation: iter 500, dev. ppl 1.003381
    - sh run.sh test_local_q1 - the model should overfit => Corpus BLEU: 99.29792465574434 (> 99)
      - this will generate outputs/test_outputs_local_q1.txt
  - Test the second part code at local
    - sh run.sh train_local_q2
      - epoch 200, iter 1000, cum. loss 0.26, cum. ppl 1.01 cum. examples 200
      - validation: iter 1000, dev. ppl 1.003469
    - sh run.sh test_local_q2 - the model should overfit => Corpus BLEU: 99.29792465574434
      - this will generate outputs/test_outputs_local_q2.txt
  - Train the model with sh run.sh train and test the performance with sh run.sh test
    - epoch 29, iter 196330, avg. loss 90.37, avg. ppl 147.15 cum. examples 10537, speed 3512.25 words/sec, time elapsed 29845.45 sec
    - reached maximum number of epochs! => Corpus BLEU: 24.20035238301319

TODO:

Enrich the sanity check of the Highway
Enrich the sanity check of the CNN
Compare the output with Assignment 4 (especially the <unk> words)
Written part

Projects

Question Answering on SQuAD

SQuAD is NOT an Natural Language Generation task. (since the answer is extracted from text.)

Default final project

handout

starter code

directory

Summerization

Dataset
- Cornell Newsroom Summarization Dataset
Metrics
- Rouge (Recall-Oriented Understudy for Gisting Evaluation)
- with small scale human eval
Baseline
- Simplest model
  - Logistic Regression on unigrams and bigrams
  - Averaging word vectors
- Lede-3 baseline

Book

O'Reilly Natural Language Processing with PyTorch

Recommend in Lecture 11

joosthub/PyTorchNLPBook: Code and data accompanying Natural Language Processing with PyTorch published by O'Reilly Media #NLPROC – Natural Language Processing

Course contents backup
Software - The Stanford Natural Language Processing Group
- Stanford NLP Chinese Usage
Others' answer
- Luvata/CS224N-2019 (almost finish all the written part as well)
- ZacBi/CS224n-2019-solutions (didn't finish the written part)
- youngmihuang/cs224n_exercise ) (only 2019 a1~a4 coding part)
- Observerspy/CS224n (not fully 2019)
- caijie12138/CS224n-2019 (not quite the assignment)
- ZeyadZanaty/cs224n-assignments (just coding part assignment 2, 3)

PyTorch notes

Element-wise Product: A * B, torch.mul(A, B), A.mul(B)
Matrix Multiplication: A @ B, torch.matmul(A, B), torch.mm, torch.bmm, ....
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
- .view() => error (only on CPU, because tensor.cuda() automatically makes the tensor contiguous)
- .contiguous().view() => okay
- .reshape() => okay

About

The course notes about Stanford CS224n Natural Language Processing with Deep Learning Winter 2019 (using PyTorch)

cs224n cs224n-assignment-solutions cs224nwinter2019 language-models dependency-parsing machine-translation question-answering

Languages

Language:JavaScript 98.1%Language:Jupyter Notebook 1.1%Language:Python 0.7%Language:TeX 0.1%Language:Shell 0.0%Language:Erlang 0.0%