The course notes about Stanford CS224n Winter 2019 (using PyTorch)
Some general notes I'll write in my Deep Learning Practice repository
Course Related Links
- Course Main Page: Winter 2019 (latest)
- Lecture Videos
- Stanford Online Hub - CS224n
Lecture
- Introduction and Word Vectors
- Word Vectors 2 and Word Senses
- Word Window Classification, Neural Networks, and Matrix Calculus
- Backpropagation and Computation Graphs
- Linguistic Structure: Dependency Parsing
- The probability of a sentence? Recurrent Neural Networks and Language Models
- Vanishing Gradients and Fancy RNNs
- Machine Translation, Seq2Seq and Attention
- Practical Tips for Final Projects - Default Final Project
- Question Answering and the Default Final Project - Default Final Project
- ConvNets for NLP
- Information from parts of words: Subword Models - Assignment 5
- Modeling contexts of use: Contextual Representations and Pretraining - ELMo, BERT
- Transformers and Self-Attention For Generative Models - Self-attention, Transformer
- Natural Language Generation
- Reference in Language and Coreference Resolution
- Multitask Learning: A general model for NLP?
- Constituency Parsing and Tree Recursive Neural Networks - TODO
- Safety, Bias, and Fairness
- Future of NLP + Deep Learning
Assignment
- Exploring Word Vectors
- word2vec
- code
- written
- Dependency Parsing
- code
- written
- Nerual Machine Translation
- code
- written
- Character-based Neural Machine Translation
- code
- written - TODO
Project
- Question Answering (Default)
- Summerization
Paper reading
- word2vec
- negative sampling
- GloVe
- improveing distrubutional similarity
- embedding evaluation methods
- Transformer
- ELMo
- BERT
- fastText
Derivation
- backprop
- slides
- notes
- readings
- Gensim example
- preparing embedding: download this zip file and unzip the
glove.6B.*d.txt
files intoembedding/GloVe
directory
- preparing embedding: download this zip file and unzip the
Outline
- Introduction to Word2vec
- objective function
- prediction function
- how to train it
- Optimization: Gradient Descent & Chain Rule
Outline
- More detail to Word2vec
- Skip-grams (SG)
- Continuous Bag of Words (CBOW)
- Similarity visualization
- Co-occurrence matrix + SVD (LSA) vs. Embedding
- Evaluation on word vectors
- Intrinsic
- Extrinsic
CS 168 The Modern Algorithmic Toolbox - for SVD
- slides
- matrix calculus
- notes
- readings
- additional readings
Outline
- Some basic idea of NLP tasks
- Matrix Calculus
- Jacobian Matrix
- Shape convention
- Loss
- Softmax
- Cross-entropy
Outline
- Computational Graph
- Backprop & Forwardprop
- Introducing regularization to prevent overfitting
- Non-linearity: activation functions
- Practical Tips
- Parameter Initialization
- Optimizers
- plain SGD
- more sophisticated adaptive optimizers
- Learing Rates
- slides
- notes
- readings
- Incrementality in Deterministic Dependency Parsing
- A Fast and Accurate Dependency Parser using Neural Networks
- Dependency Parsing
- Globally Normalized Transition-Based Neural Networks
- Universal Stanford Dependencies: A cross-linguistic typology
- Universal Dependencies website
Outline
- Methods of Dependency Parsing
- Dynamic Programming
- complexity O(n³)
- Graph Algorithm
- create a minimum spanning tree for a sentence
- Constraint Satisfaction
- edges are eliminated that don't satisfy hard constraints
- Transition-based Parsing / Deterministic Dependency Parsing
- greedy choice of attachments guided by machine learning classifier
- complexity O(n)
- Dynamic Programming
- Operations of the Shift-reduce Parser
- Shift
- Left-Arc
- Right-Arc
- Attachment Errors
- Prepositional Phrase Attachment Errors
- Verb Phrase Attachment Errors
- Modifier Attachment Errors
- Coordination Attachment Errors
mentioned CS103, CS228
- slides
- notes
- readings
- N-gram Language Models (textbook chapter)
- The Unreasonable Effectiveness of Recurrent Neural Networks (blog post overview)
- Sequence Modeling: Recurrent and Recursive Neural Nets (Sections 10.1 and 10.2)
- On Chomsky and the Two Cultures of Statistical Learning
- N-gram Language Model
- Fixed-window Neural Language Model
- vanilla RNN
- Language Modeling: the task of predicting the next word, given the words so far
- Language Model: a system that produces the probability distribution for the next candidate word
- Conditional Language Modeling: the task of predicting the next word, given the words so far, and also some other input x
- Machine Translation (x=source sentence, y=target sentence)
- Summarization (x=input text, y=summarized text)
- Dialogue (x=dialogue history, y=next utterance)
- ...
- slides
- notes - same as lecture 6
- readings
- Sequence Modeling: Recurrent and Recursive Neural Nets - (textbook sections 10.3, 10.5, 10.7-10.12)
- Learning long-term dependencies with gradient descent is difficult (one of the original vanishing gradient papers)
- On the difficulty of training Recurrent Neural Networks (proof of vanishing gradient problem)
- Vanishing Gradients Jupyter Notebook (demo for feedforward networks)
- Understanding LSTM Networks (blog post overview)
Vanishing gradient =>
- LSTM and GRU
- slides
- notes
- readings
- Statistical Machine Translation slides, CS224n 2015 (lectures 2/3/4)
- Statistical Machine Translation (book by Philipp Koehn)
- BLEU (a Method for Automatic Evaluation of Machine Translate) (original paper)
- Sequence to Sequence Learning with Neural Networks (original seq2seq NMT paper)
- Sequence Transduction with Recurrent Neural Networks (early seq2seq speech recognition paper)
- Neural Machine Translation by Jointly Learning to Align and Translate (original seq2seq+attention paper)
- Attention and Augmented Recurrent Neural Networks (blog post overview)
- Massive Exploration of Neural Machine Translation Architectures (practical advice for hyperparameter choices)
- Training method: Teacher Forcing
- During training, we feed the gold (aka reference) target sentence into the decoder, regardless of what the decoder predicts.
- During testing (decoding): Beam Search vs. Greedy Decoding
- Decoding Algorithm: an algorithm you use to generate text from your language model
- Greedy Decoding => lack of backtracking
- on each step take the most probable word (i.e. argmax)
- use that as the next word, and feed it as input on the next step
- keep going until you produce
<END>
or reach some max length- Beam Search: aims to find high-probability sequence by tracking multiple possible sequences at once
- on each step of decoder, keep track of the k (beam size) most probable partial sequences (hypotheses)
- after you reach some stopping criterion (get n complete hypotheses (each stop when reach max depth, produce
<END>
)), choose the sequence with the highest probability (with score normalization)
- slides
- readings
ELMo, BERT
guest lecture
- slides
- readings
Self-attention, Transformer
- slides
- notes - Good notes about finding existing research, datasets and tasks
- readings
- Practical Methodology (Deep Learning book chapter)
Vanishing Gradient, LSTM, GRU (again)
some more Attention, mentioned CS 276: Information Retrieval and Web Search
Quick notes about QA:
- QA types
- Factoid QA: answer is an NER (some clear semantic type entity)
- Extractive QA: answer must be a span (a sub-sequence of words) in the passage
- e.g. SQuAD 1.X
- defect: all questions have an answer in the paragraph => turned into a kind of a ranking task
- Extractive QA + NoAnswer: some question might have no answer in the paragraph
- e.g. SQuAD 2.0
- limitation:
- only span-based answers (no yes/no, counting, implicit why)
- ...
- Open-domain QA
mentioned CS231n: Convolutional Neural Networks for Visual Recognition
Lot of common technique (nowadays)
- Model Comparison
- Bag of Vectors: take the word vectors and averaging them
- good baseline
- better have followed by a few ReLU
- Window Model
- good for single word classification (for problems that don't need wide context e.g. POS, NER)
- CNNs
- good for classification
- need zero padding for shorter phrases
- easy to parallelize
- RNNs
- cognitively plausible (reading from left to right)
- not best for classification (if just use last state)
- much slower than CNNs
- good for sequence tagging
- great for language models and can be amazing with attention mechanism
- Bag of Vectors: take the word vectors and averaging them
- Dropout
- for regularization => prevent overfitting
- gives 2~4% accuracy improvement
- Gated units used vertically: shortcut connection (is needed for very deep networks to work)
- Residual block
- Highway block
- BatchNorm
- Z-transform: zero mean and unit variance
- slides
- readings
fastText
Outline
- Decoding mehtods
- Greedy decoding
- Beam search
- Sampling-based decoding: good for open-ended/creative generation (poetry, stories)
- Pure sampling: like greedy decoding, but sample instead of argmax
- Top-n sampling: like pure sampling, but truncate the probability distribution
Softmax temperature: another way to control diversity
- NLG Tasks
- Machine Translation
- (Abstractive) Summarization
- Evaluation: ROUGE
- Dialogue
- chit-chat
- task-based
- Creative writing
- Storytelling
- Poetry-generation
- Freefrom Question Answering
- Image captioning
- ...
- NLG Evaluation Metrics
- Word overlap based metrics
- BLEU
- ROUGE
- METEOR
- F1
- ...
- (Perplexity) doesn't tell you anything about generation
- Word embedding based metrics
- Human evaluation
- Word overlap based metrics
Outline
- Coreference Resolution: identify all mentions that refer to the same real world entity
- Application
- Full text understanding
- Machine translation
- Dialogue systems
- Step (Pipelined system)
- Detect the mentions => using other NLP system
- Cluster the mentions
- End-to-end system
- Model
- Rule-based (pronomial anaphora resolution)
- can't solve sentences which have identical syntactic structure
- Mention Pair
- binary classifier: coreferent or not (for every pair of mentions)
- custering
- pick a threshold and add coreference links when above
- take the transitive closure to get the clustering
- Mention Ranking
- assign each mention its highest scoring candidate antecedent
- add dummy NA mention at the front (for decline linking)
- Clustering
- Agglomerative clustering
- start with each mention in its own singleton cluster
- merge a pair of clusters at each step
- Agglomerative clustering
- Rule-based (pronomial anaphora resolution)
- Application
- Mention: span of text referring to some entity
- pronouns
- capture use a part-of-speech tagger
- named entities
- capture use a NER system
- noun phrases
- capture use a parser (especially a constituency parser)
- pronouns
- Linguistics stuff
- Coreference: two mentions refer to the same entity in the world
- Anaphora: when a term (anaphor) refers to another term (antecedent)
- Pronominal Anaphora (Coreferential one)
- Bridging Anaphora (Not Coreferential)
- Cataphora: when antecedent comes after (usually before) the anaphor
Outline
- Natural Language Decathlon (decaNLP)
- => reduce subtask to more general task => transfer knowledge from the other task => maybe then we can do Zero-shot Learning / Transfer Learning
- salesforce/decaNLP: The Natural Language Decathlon: A Multitask Challenge for NLP
- 3 equivalent supertasks of NLP
- Language Modeling
- predict next word
- embedding...
- Question Answering Formalism (Multitask Learning as QA) => Training single question answering model for multiple NLP tasks (aka. questions)
- Question Answering
- Machine Translation
- Summarization
- Natural Language Inference
- Sentiment Classification
- Semantic Role Labeling
- Relation Extraction
- Dialogue
- Semantic Parsing
- Commonsense Reasoning
- Dialogue
- Language Modeling
- Framework for tackling
- more general language understanding
- multitask learning
- domain adaptation
- transfer learning
- weight sharing, pre-training, fine-tuning (towards ImageNet-CNN of NLP)
- zero-shot learning
Outline
- co-occurrance matrix + Truncated SVD
- pre-trained word2vec
Outline
- Train word2vec with skip-gram model and negative sampling using stochastic gradient descent
Related
Others' Answer
- handout
- directory
- written
- code
python3 parser_transitions.py part_c
check the corretness of transition mechanicspython3 parser_transitions.py part_d
check the correctness of minibatch parsepython3 run.py
- set
debug=True
to test the process (debug_out.log
) - set
debug=False
to train on the entire dataset (train_out.log
)- best UAS on the dev set: 88.79 (epoch 9/10)
- best UAS on the test set: 89.27
- set
Outline
- Adam Optimizer
- Dropout
- Neural Transition-based Dependency Parser (a shift-reduce parser)
Others' Answer
- handout
- Asure Guide (Google Drive), Practical Guide to VMs (Google Drive)
- directory
- written - BLEU Verify
- A Gentle Introduction to Calculating the BLEU Score for Text in Python
nltk.translate.bleu_score
- Tilde Interactive BLEU score evaluator - input txt
- A Gentle Introduction to Calculating the BLEU Score for Text in Python
- code
python3 sanity_check.py 1d
check the correctness of encode procedure (including utils.pad_sents)python3 sanity_check.py 1e
check the correctness of decode procedure (including step function)- Preprocess the training data by
sh run.sh vocab
to get the necessary vocabulary - Test the functionality on CPU: train
sh run.sh train_local
; testsh run.sh test_local
- (speed about 100 words/sec on Macbook Air 1.8GHz i5 CPU)
- Train and Test with GPU: train
sh run.sh train
; testsh run.sh test
- (speed about 5000 words/sec on Nvidia GeForce GTX 1080 GPU)
- (this will generate model image
model.bin
and optimizers' statemodel.bin.optim
) - early stop on
epoch 13, iter 86000, cum. loss 28.94, cum. ppl 5.13 cum. examples 64000
=> Corpus BLEU: 22.36579929869114
- Compare output with references
vim -dO outputs/test_outputs.txt en_es_data/test.en
- Open three of them at the same time
vim -o outputs/test_outputs.txt en_es_data/test.en en_es_data/test.es
- written - BLEU Verify
Other's Answer
build a character level ConvNet
- handout
- directory
- written
- code
- Create the correct vocab files
sh run.sh vocab
vocab_tiny_q1.json
: generated vocabulary, source 132 words, target 132 words- source: number of word types: 128, number of word types w/ frequency >= 1: 128
- target: number of word types: 130, number of word types w/ frequency >= 1: 130
vocab_tiny_q2.json
: generated vocabulary, source 26 words, target 32 words- source: number of word types: 128, number of word types w/ frequency >= 2: 22
- target: number of word types: 130, number of word types w/ frequency >= 2: 30
vocab.json
: generated vocabulary, source 50004 words, target 50002 words- source: number of word types: 172418, number of word types w/ frequency >= 2: 80623
- target: number of word types: 128873, number of word types w/ frequency >= 2: 64215
- Sanity Checks
python3 sanity_check.py [part]
- pre-defined: (1e, 1f, 1j, 2a, 2b, 2c, 2d)
- customized: (1g, 1h, 1i, 1j)
- Test the first part code at local
sh run.sh train_local_q1
- this will run 100 epochesepoch 100, iter 500, cum. loss 0.31, cum. ppl 1.02 cum. examples 200
validation: iter 500, dev. ppl 1.003381
sh run.sh test_local_q1
- the model should overfit => Corpus BLEU: 99.29792465574434 (> 99)- this will generate
outputs/test_outputs_local_q1.txt
- this will generate
- Test the second part code at local
sh run.sh train_local_q2
epoch 200, iter 1000, cum. loss 0.26, cum. ppl 1.01 cum. examples 200
validation: iter 1000, dev. ppl 1.003469
sh run.sh test_local_q2
- the model should overfit => Corpus BLEU: 99.29792465574434- this will generate
outputs/test_outputs_local_q2.txt
- this will generate
- Train the model with
sh run.sh train
and test the performance withsh run.sh test
epoch 29, iter 196330, avg. loss 90.37, avg. ppl 147.15 cum. examples 10537, speed 3512.25 words/sec, time elapsed 29845.45 sec
reached maximum number of epochs!
=> Corpus BLEU: 24.20035238301319
- Create the correct vocab files
TODO:
- Enrich the sanity check of the Highway
- Enrich the sanity check of the CNN
- Compare the output with Assignment 4 (especially the
<unk>
words) - Written part
SQuAD is NOT an Natural Language Generation task. (since the answer is extracted from text.)
Default final project
- Dataset
- Metrics
- Rouge (Recall-Oriented Understudy for Gisting Evaluation)
- with small scale human eval
- Baseline
- Simplest model
- Logistic Regression on unigrams and bigrams
- Averaging word vectors
- Lede-3 baseline
- Simplest model
Recommend in Lecture 11
- joosthub/PyTorchNLPBook: Code and data accompanying Natural Language Processing with PyTorch published by O'Reilly Media #NLPROC – Natural Language Processing
- Course contents backup
- Software - The Stanford Natural Language Processing Group
- Others' answer
- Luvata/CS224N-2019 (almost finish all the written part as well)
- ZacBi/CS224n-2019-solutions (didn't finish the written part)
- youngmihuang/cs224n_exercise ) (only 2019 a1~a4 coding part)
- Observerspy/CS224n (not fully 2019)
- caijie12138/CS224n-2019 (not quite the assignment)
- ZeyadZanaty/cs224n-assignments (just coding part assignment 2, 3)
PyTorch notes
- Element-wise Product:
A * B
,torch.mul(A, B)
,A.mul(B)
- Matrix Multiplication:
A @ B
,torch.matmul(A, B)
,torch.mm
,torch.bmm
, .... - RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
.view()
=> error (only on CPU, becausetensor.cuda()
automatically makes the tensor contiguous).contiguous().view()
=> okay.reshape()
=> okay