[cs224n] Lecture 2 – Word Vectors and Word Senses

Lecture 2 – Word Vectors and Word Senses

[cs224n] Lecture 2 – Word Vectors and Word Senses


1. Review: Main idea of word2vec

[cs224n] Lecture 2 – Word Vectors and Word Senses

Word2vec parameters and computations

[cs224n] Lecture 2 – Word Vectors and Word Senses

Word2vec maximizes objective function by putting similar words nearby in space

[cs224n] Lecture 2 – Word Vectors and Word Senses

 

2. Optimization: Gradient Descent

[cs224n] Lecture 2 – Word Vectors and Word Senses

Gradient Descent

[cs224n] Lecture 2 – Word Vectors and Word Senses

Stochastic Gradient Descent

[cs224n] Lecture 2 – Word Vectors and Word Senses

Stochastic gradients with word vectors!

[cs224n] Lecture 2 – Word Vectors and Word Senses

[cs224n] Lecture 2 – Word Vectors and Word Senses

1b. Word2vec: More details

So far, we have looked at two main classes of methods to find word embeddings. The first set are count-based and rely on matrix factorization (e.g. LSA, HAL). While these methods effectively leverage global statistical information, they are primarily used to capture word similarities and do poorly on tasks such as word analogy, indicating a sub-optimal vector space structure. The other set of methods are shallow window-based (e.g. the skip-gram and the CBOW models), which learn word embeddings by making predictions in local context windows. These models demonstrate the capacity to capture complex linguistic patterns beyond word similarity, but fail to make use of the global co-occurrence statistics.

[cs224n] Lecture 2 – Word Vectors and Word Senses

The skip-gram model with negative sampling (HW2)

[cs224n] Lecture 2 – Word Vectors and Word Senses

In comparison, GloVe consists of a weighted least squares model that trains on global word-word co-occurrence counts and thus makes efficient use of statistics. The model produces a word vector space with meaningful sub-structure. It shows state-of-the-art performance on the word analogy task, and outperforms other current methods on several word similarity tasks. 

[cs224n] Lecture 2 – Word Vectors and Word Senses

[cs224n] Lecture 2 – Word Vectors and Word Senses


3. But why not capture co-occurrence counts directly?

[cs224n] Lecture 2 – Word Vectors and Word Senses

Example: Window based co-occurrence matrix

[cs224n] Lecture 2 – Word Vectors and Word Senses

Window based co-occurrence matrix

[cs224n] Lecture 2 – Word Vectors and Word Senses

Problems with simple co-occurrence vectors

[cs224n] Lecture 2 – Word Vectors and Word Senses

Solution: Low dimensional vectors

[cs224n] Lecture 2 – Word Vectors and Word Senses

Method 1: Dimensionality Reduction on X (HW1)

[cs224n] Lecture 2 – Word Vectors and Word Senses

Simple SVD word vectors in Python

[cs224n] Lecture 2 – Word Vectors and Word Senses

[cs224n] Lecture 2 – Word Vectors and Word Senses

Hacks to X (several used in Rohde et al. 2005)

[cs224n] Lecture 2 – Word Vectors and Word Senses

Interesting syntactic patterns emerge in the vectors

[cs224n] Lecture 2 – Word Vectors and Word Senses

[cs224n] Lecture 2 – Word Vectors and Word Senses

Count based vs. direct prediction

[cs224n] Lecture 2 – Word Vectors and Word Senses

[cs224n] Lecture 2 – Word Vectors and Word Senses

[cs224n] Lecture 2 – Word Vectors and Word Senses

[cs224n] Lecture 2 – Word Vectors and Word Senses

[cs224n] Lecture 2 – Word Vectors and Word Senses

[cs224n] Lecture 2 – Word Vectors and Word Senses

How to evaluate word vectors?

[cs224n] Lecture 2 – Word Vectors and Word Senses

Intrinsic word vector evaluation

[cs224n] Lecture 2 – Word Vectors and Word Senses

Glove Visualizations

[cs224n] Lecture 2 – Word Vectors and Word Senses

Glove Visualizations: Company - CEO

[cs224n] Lecture 2 – Word Vectors and Word Senses

Glove Visualizations: Superlatives

[cs224n] Lecture 2 – Word Vectors and Word Senses

Details of intrinsic word vector evaluation

[cs224n] Lecture 2 – Word Vectors and Word Senses

[cs224n] Lecture 2 – Word Vectors and Word Senses

Analogy evaluation and hyperparameters

[cs224n] Lecture 2 – Word Vectors and Word Senses

[cs224n] Lecture 2 – Word Vectors and Word Senses

[cs224n] Lecture 2 – Word Vectors and Word Senses

 Analogy evaluation and hyperparameters

[cs224n] Lecture 2 – Word Vectors and Word Senses

[cs224n] Lecture 2 – Word Vectors and Word Senses

Another intrinsic word vector evaluation

[cs224n] Lecture 2 – Word Vectors and Word Senses

Closest words to “Sweden” (cosine similarity)

[cs224n] Lecture 2 – Word Vectors and Word Senses

Correlation evaluation

[cs224n] Lecture 2 – Word Vectors and Word Senses

 Word senses and word sense ambiguity

[cs224n] Lecture 2 – Word Vectors and Word Senses

pike

[cs224n] Lecture 2 – Word Vectors and Word Senses

Improving Word Representations Via Global Context And Multiple Word Prototypes (Huang et al. 2012)

[cs224n] Lecture 2 – Word Vectors and Word Senses

Linear Algebraic Structure of Word Senses, with Applications to Polysemy

[cs224n] Lecture 2 – Word Vectors and Word Senses

Extrinsic word vector evaluation

[cs224n] Lecture 2 – Word Vectors and Word Senses

Course plan: coming weeks

[cs224n] Lecture 2 – Word Vectors and Word Senses

[cs224n] Lecture 2 – Word Vectors and Word Senses

Office Hours / Help sessions

[cs224n] Lecture 2 – Word Vectors and Word Senses