Word Embeddings

$$\text{king} - \text{man} + \text{woman} \approx \text{queen}$$

Notes:

Word Embeddings

Represent words with one-hot vectors
Train neural network to predict next word
Use large text corpus like Wikipedia

Similar vectors ≈ Related words ≈ Occur in similar contexts

Notes:

Words as vectors

Corpus: The quick brown fox jumps over the lazy dog.
Vocabulary: [brown, dog, fox, jump, lazy, quick].
Vector for brown (one-hot encoding):

Word	Vector
brown	1
dog	0
fox	0
jump	0
lazy	0
quick	0

Notes: What does the vector for brown look like?

Train similarity

\begin{aligned} sim(\text{quick}, \text{brown}) &> sim(\text{quick}, \text{dog})\\\\ sim(\begin{pmatrix}1 \\\\ 0 \\\\ 0 \\\\ 0 \\\\ 0 \\\\ 0\end{pmatrix}, \begin{pmatrix}0 \\\\ 1 \\\\ 0 \\\\ 0 \\\\ 0 \\\\ 0\end{pmatrix}) &> sim(\begin{pmatrix}1 \\\\ 0 \\\\ 0 \\\\ 0 \\\\ 0 \\\\ 0\end{pmatrix}, \begin{pmatrix}0 \\\\ 0 \\\\ 0 \\\\ 0 \\\\ 0 \\\\ 1\end{pmatrix}) \end{aligned}

Notes:

Training data

The quick brown fox jumps over the lazy dog.

Slide window over corpus:

The quick brown fox jumps over the lazy dog
- [the → quick]
The quick brown fox jumps over the lazy dog
- [quick → brown]
The quick brown fox jumps over the lazy dog
- [brown → fox]

When Machine sees brown it should predict fox.

Notes:

Skipgram

Predict context from word.

The quick brown brown fox jumps over the lazy dog: [quick → the, brown]
The quick brown fox jumps over the lazy dog: [brown → quick, fox]
The quick brown fox jumps over the lazy dog: [fox → brown, jumps]

When Machine sees quick it should predict the or brown.

Window can be larger (recommended: 5).

Skipgram well suited for small data sets with rare words.

Notes:

Continuous Bag of Words (CBOW)

Predict word from context.

The quick brown brown fox jumps over the lazy dog: [the, brown → quick]
The quick brown fox jumps over the lazy dog: [quick, fox → brown]
The quick brown fox jumps over the lazy dog: [brown, jumps → fox]

When Machine sees the or brown it should predict quick.

CBOW trains faster, more accurate for frequent words.

Notes:

Word Embedding visualized

Word Embedding Visual Inspector

Notes:

Word Embedding play time

View Word Embedding Notebook

Download Word Embedding Notebook and simple-wikipedia.zip
unzip -q simple-wikipedia.zip
Run Jupyter:

docker run -p 8888:8888 -e GRANT_SUDO=yes -u root -v "$PWD:/home/jovyan/work" jupyterhub/singleuser

Open http://localhost:8888/lab with the token from the console

Notes:

Word2Vec Alternatives

GloVe

More suitable for document-level tasks
E.g. document-document similarity, topic modeling.
Pre-trained for many languages

fastText

Uses n-grams instead of words
Can match unknown words by matching n-grams
Can also be used for text classification
Pre-trained for many languages

Notes:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

section_word_embedding.md

section_word_embedding.md

Word Embeddings

Word Embeddings

Words as vectors

Train similarity

Training data

Skipgram

Continuous Bag of Words (CBOW)

Word Embedding visualized

Word Embedding play time

Word2Vec Alternatives

Files

section_word_embedding.md

Latest commit

History

section_word_embedding.md

File metadata and controls

Word Embeddings

Word Embeddings

Words as vectors

Train similarity

Training data

Skipgram

Continuous Bag of Words (CBOW)

Word Embedding visualized

Word Embedding play time

Word2Vec Alternatives