Natural Language Processing with Deep Learning

1 minute read

Published:

CS224n: Natural Language Processing with Deep Learning by Christopher Manning at Stanford University.


Why is NLP hard?

  • Complexity in representing, learning and using linguistic/situational/word/visual knowledge

  • Human languages are ambiguous (unlike programming and other formal languages)

  • Human language interpretation depends on real world, common sense, and contextual knowledge



Main idea of word2vec

Two algorithms

  • skip-grams (SG): predict context words given target (position independent)

  • Continuous Bag of Words (CBOW): predict target word from bag-of-words context

Two (moderately efficient training methods)

  • Hierarchical softmax

  • Negative sampling

$J(\theta) = -\frac{1}{T}\sum_{t=1}^T \sum_{-m \le j \le m, j \neq 0} \log p(w_{t+j} \mid w_{t})$

$p(o \mid c) = \frac{\exp (u_o^T v_c)}{\sum_{w=1}^V \exp (u_w^T v_c)}$



Negative sampling: train binary logistic regressions for a true pair (center word and word in its context window) versus a couple of noise pairs (the center word paired with a random word)

The skip-gram model with negative sampling

$J(\theta) = \frac{1}{T} \sum_{t=1}^T J_t(\theta)$

$J_t(\theta) = \log \sigma(u_o^Tv_c) + \sum_{j \sim P(\omega)} [\log \sigma(-u_j^Tv_c)]$ : maximize probability that real outside word appears, minimize probability that random words appear around center word

GloVe: $J(\theta) = \frac{1}{2} \sum_{i,j=1}^W f(P_{ij})(u_i^Tv_j - \log P_{ij})^2$ Count-based + Direct prediction method



If you only have a small training dataset, don’t train the word vectors.

If you have a very large dataset, it may work better to train word vectors to the task.

The max-margin loss: $J = \max (0, 1-s+s_c)$

Idea for training objective: make score of true window larger and corrupt window’s score lower (until they’re good enough)



Chain rule, Nothing fancy!





TensorFlow = Tensor + Flow