Natural Language Processing with Deep Learning
Published:
CS224n: Natural Language Processing with Deep Learning by Christopher Manning at Stanford University.
Why is NLP hard?
Complexity in representing, learning and using linguistic/situational/word/visual knowledge
Human languages are ambiguous (unlike programming and other formal languages)
Human language interpretation depends on real world, common sense, and contextual knowledge
Main idea of word2vec
Two algorithms
skip-grams (SG): predict context words given target (position independent)
Continuous Bag of Words (CBOW): predict target word from bag-of-words context
Two (moderately efficient training methods)
Hierarchical softmax
Negative sampling
$J(\theta) = -\frac{1}{T}\sum_{t=1}^T \sum_{-m \le j \le m, j \neq 0} \log p(w_{t+j} \mid w_{t})$
$p(o \mid c) = \frac{\exp (u_o^T v_c)}{\sum_{w=1}^V \exp (u_w^T v_c)}$
Negative sampling: train binary logistic regressions for a true pair (center word and word in its context window) versus a couple of noise pairs (the center word paired with a random word)
The skip-gram model with negative sampling
$J(\theta) = \frac{1}{T} \sum_{t=1}^T J_t(\theta)$
$J_t(\theta) = \log \sigma(u_o^Tv_c) + \sum_{j \sim P(\omega)} [\log \sigma(-u_j^Tv_c)]$ : maximize probability that real outside word appears, minimize probability that random words appear around center word
GloVe: $J(\theta) = \frac{1}{2} \sum_{i,j=1}^W f(P_{ij})(u_i^Tv_j - \log P_{ij})^2$ Count-based + Direct prediction method
If you only have a small training dataset, don’t train the word vectors.
If you have a very large dataset, it may work better to train word vectors to the task.
The max-margin loss: $J = \max (0, 1-s+s_c)$
Idea for training objective: make score of true window larger and corrupt window’s score lower (until they’re good enough)
Chain rule, Nothing fancy!
TensorFlow = Tensor + Flow