Motivation

Vectors are one way of representing words (or sentences, or smaller tokens). They are convenient for us, because we can use the tools of linear algebra to manipulate them. Most importantly, we can compare them to other vectors using dot product and cosine similarity.

Word vectors of relatively small dimensionality (\(x \in \mathbb{R}^{300}\)) are still the norm in NLP.

WordNet

Before word vectors, we had WordNet, a knowledge hierarchy. WordNet was made by humans to track, among other things, the many different senses of words (‘word sense disambiguation’). It proved rather brittle and could not scale to the level we require for modern language tasks like translation.

One-Hot Vectors

Imagine that each vector was the length of some fixed ‘vocabulary’ size, which might be in the hundreds of thousands (conservatively). Each word is represented by one digit in the vector; all other digits are zero (hence only ‘one’ digit is ‘hot’). One-hot vectors are all orthogonal to each other, which makes them useless for dot product comparison.

Pretrained Word Vectors

It was common, for a time (2012-2017?), to pretrain word vectors and use these representations as input for some ‘downstream’ task, like sentiment analysis. More recently, large language models train their own word vectors as part of their language modeling task.

Training Word Vectors

Distributional Semantics

The premise that words have similar meaning to other words in which they frequently appear in context.

“You shall know a word by the company it keeps.” – JR Firth (1957)

Word2vec (Mikolov et al, 2013)

An early (and successful) attempt to train word vectors based on the distributional hypothesis.

definitions

This model stores two vectors for each word: \(v\) for the word when considered in the center, \(u\) in the outer (context). \(\theta\), the model parameters, is a vector that contains all of these individual vector. If each word vector has size \(d\) (for example, 300), then \(\theta \in \mathbb{R}^{2dV}\), where \(V\) is the number of words (the size of the vocabulary.)

objective

Word2vec is meant to product word vectors that have a high dot product when they appear in context together frequently. If a word \(u_o\) is in context with a center word \(v_c\), then \(u_o^Tv_c\) ought to be relatively high. If the two words are not found together in the corpus, the similarity ought to be low.

What is the “bag of words” assumption?

A model like this allows us to assign a “likelihood” to any sentence:

\[L(\theta) = \prod_{i=1}^T \prod_{-m \leq j \leq m, j \neq 0} P(w_{t+j}|w_{0},\theta)\]

We use the model’s word vectors to calculate this likelihood, so we say that the likelihood is “under the model”.

It is easier to work in log space with log probabilities, because addition is easier than multiplication, and we won’t have to worry about floating point problems.

\[L(\theta) = \sum_{i=1}^T \sum_{-m \leq j \leq m, j \neq 0} \log P(w_{t+j}|w_{0},\theta)\]

To use backpropagation like all machine learners must, we recast our definition of likelihood into an “objective function”, which we usually call \(J(\theta)\), the loss of our model.

\[J(\theta) = -1 * \frac{1}{T} * \sum_{i=1}^T \sum_{-m \leq j \leq m, j \neq 0} \log P(w_{t+j}|w_{0},\theta)\]

This may look a bit hairy, but take a moment to try out the equation with some sample values. If the loss function gives a low value, that means the model is performing well. \(\log(1) = 0\), so we want our model to assign a high probability to words that do appear together in context.

The \(\frac{1}{T}\) term is there to normalize over the number of words in the sentence, which will vary from sentence to sentence.

softmax

If we want to find \(P(o|c)\), the probability of an outer word given the center word, we need to return a value between 0 and 1. Happily, we have the “softmax” function:

\[P(o|c) = \frac{exp(u_o^T v_c)}{\sum_{w \in V} exp(u_w^Tv_c)}\]

The \(exp\) term is there to ensure that all the dot products are positive; moreover it amplifies higher values in the distribution without completely squashing lower values.

In the softmax, we’re just returning the dot product of our two vectors of interest, normalized by the dot product of all the other words with our center word.

The softmax returns a distribution in which all the values sum to 1. It preserves the relative size of the distribution, too – the larget values in the input will be the largest values in the output.

optimization

Trained through stochastic gradient descent using a constant (small) step size, \(\alpha\).

“Simply” take the gradient of the objective function and you’ll get an indication of which parts of \(\theta\) are causing you trouble, and in which direction they need to be tuned.

Gradient Descent: \(\theta_1 = \theta_0 - \alpha \nabla_\theta J(\theta_0)\)

What if the step size is too small? Too large?

What are “stochastic” and “batch” gradient descent (two different things), and why do we prefer them to normal gradient descent?

There might be an analogy between the “noisy” updates in stochastic gradient descent, and dithering.

negative sampling

Why do this as opposed to “naive softmax”?

\[J_t(\theta) = \log \sigma(u_o^T v_c) \sum_{i=1}^k \mathbb{E}_{j \sim P(w)} [\log \sigma(-u_j^T v_c)]\]

The objective function above is maximized, not minimized (as was negative log likelihood loss).

sigmoid (logistic) function

\[\sigma(x) = \frac{1}{1 + e^{-x}}\]

Co-occurrence Matrices

How are they built? What is SVD? How should the counts be scaled?