Skip to content

Latest commit

 

History

History
164 lines (106 loc) · 4.29 KB

section_word_embedding.md

File metadata and controls

164 lines (106 loc) · 4.29 KB

Word Embeddings

Word Embeddings

$$\text{king} - \text{man} + \text{woman} \approx \text{queen}$$

Notes:


Word Embeddings

  • Represent words with one-hot vectors
  • Train neural network to predict next word
  • Use large text corpus like Wikipedia

Similar vectors ≈ Related words ≈ Occur in similar contexts

Notes:


Words as vectors

  • Corpus: The quick brown fox jumps over the lazy dog.
  • Vocabulary: [brown, dog, fox, jump, lazy, quick].
  • Vector for brown (one-hot encoding):
Word Vector
brown 1
dog 0
fox 0
jump 0
lazy 0
quick 0

Notes: What does the vector for brown look like?


Train similarity

\begin{aligned} sim(\text{quick}, \text{brown}) &> sim(\text{quick}, \text{dog})\\\\ sim(\begin{pmatrix}1 \\\\ 0 \\\\ 0 \\\\ 0 \\\\ 0 \\\\ 0\end{pmatrix}, \begin{pmatrix}0 \\\\ 1 \\\\ 0 \\\\ 0 \\\\ 0 \\\\ 0\end{pmatrix}) &> sim(\begin{pmatrix}1 \\\\ 0 \\\\ 0 \\\\ 0 \\\\ 0 \\\\ 0\end{pmatrix}, \begin{pmatrix}0 \\\\ 0 \\\\ 0 \\\\ 0 \\\\ 0 \\\\ 1\end{pmatrix}) \end{aligned}

Notes:


Training data

The quick brown fox jumps over the lazy dog.

Slide window over corpus:

  1. ­ The quick brown fox jumps over the lazy dog
    • [thequick]
  2. ­ The quick brown fox jumps over the lazy dog
    • [quickbrown]
  3. ­ The quick brown fox jumps over the lazy dog
    • [brownfox]

­When Machine sees brown it should predict fox.

Notes:


Skipgram

Predict context from word.

  1. The quick brown brown fox jumps over the lazy dog: [quickthe, brown]
  2. The quick brown fox jumps over the lazy dog: [brownquick, fox]
  3. The quick brown fox jumps over the lazy dog: [foxbrown, jumps]

When Machine sees quick it should predict the or brown.

Window can be larger (recommended: 5).

Skipgram well suited for small data sets with rare words.

Notes:


Continuous Bag of Words (CBOW)

Predict word from context.

  1. The quick brown brown fox jumps over the lazy dog: [the, brownquick]
  2. The quick brown fox jumps over the lazy dog: [quick, foxbrown]
  3. The quick brown fox jumps over the lazy dog: [brown, jumpsfox]

When Machine sees the or brown it should predict quick.

CBOW trains faster, more accurate for frequent words.

Notes:


Word Embedding visualized

Word Embedding Visual Inspector

Notes:


Word Embedding play time

View Word Embedding Notebook

  1. Download Word Embedding Notebook and simple-wikipedia.zip
  2. unzip -q simple-wikipedia.zip
  3. Run Jupyter:
docker run -p 8888:8888 -e GRANT_SUDO=yes -u root -v "$PWD:/home/jovyan/work" jupyterhub/singleuser
  1. Open http://localhost:8888/lab with the token from the console

Notes:


Word2Vec Alternatives

GloVe

  • More suitable for document-level tasks
  • E.g. document-document similarity, topic modeling.
  • Pre-trained for many languages

fastText

  • Uses n-grams instead of words
  • Can match unknown words by matching n-grams
  • Can also be used for text classification
  • Pre-trained for many languages

Notes: