Notes:
- Represent words with one-hot vectors
- Train neural network to predict next word
- Use large text corpus like Wikipedia
Similar vectors ≈ Related words ≈ Occur in similar contexts
Notes:
- Corpus: The quick brown fox jumps over the lazy dog.
- Vocabulary: [brown, dog, fox, jump, lazy, quick].
- Vector for brown (one-hot encoding):
Word | Vector |
---|---|
brown | 1 |
dog | 0 |
fox | 0 |
jump | 0 |
lazy | 0 |
quick | 0 |
Notes: What does the vector for brown look like?
\begin{aligned}
sim(\text{quick}, \text{brown}) &> sim(\text{quick}, \text{dog})\\\\
sim(\begin{pmatrix}1 \\\\ 0 \\\\ 0 \\\\ 0 \\\\ 0 \\\\ 0\end{pmatrix}, \begin{pmatrix}0 \\\\ 1 \\\\ 0 \\\\ 0 \\\\ 0 \\\\ 0\end{pmatrix}) &> sim(\begin{pmatrix}1 \\\\ 0 \\\\ 0 \\\\ 0 \\\\ 0 \\\\ 0\end{pmatrix}, \begin{pmatrix}0 \\\\ 0 \\\\ 0 \\\\ 0 \\\\ 0 \\\\ 1\end{pmatrix})
\end{aligned}
Notes:
The quick brown fox jumps over the lazy dog.
Slide window over corpus:
-
The quick brown fox jumps over the lazy dog
- [the → quick]
- The quick
brown fox jumps over the lazy dog
- [quick → brown]
- The quick
brown fox jumps over the lazy dog
- [brown → fox]
When Machine sees brown it should predict fox.
Notes:
Predict context from word.
- The quick brown brown fox jumps over the lazy dog: [quick → the, brown]
- The quick brown fox jumps over the lazy dog: [brown → quick, fox]
- The quick brown fox jumps over the lazy dog: [fox → brown, jumps]
When Machine sees quick it should predict the or brown.
Window can be larger (recommended: 5).
Skipgram well suited for small data sets with rare words.
Notes:
Predict word from context.
- The quick brown brown fox jumps over the lazy dog: [the, brown → quick]
- The quick brown fox jumps over the lazy dog: [quick, fox → brown]
- The quick brown fox jumps over the lazy dog: [brown, jumps → fox]
When Machine sees the or brown it should predict quick.
CBOW trains faster, more accurate for frequent words.
Notes:
Word Embedding Visual Inspector
Notes:
- Download Word Embedding Notebook and simple-wikipedia.zip
unzip -q simple-wikipedia.zip
- Run Jupyter:
docker run -p 8888:8888 -e GRANT_SUDO=yes -u root -v "$PWD:/home/jovyan/work" jupyterhub/singleuser
- Open http://localhost:8888/lab with the token from the console
Notes:
- More suitable for document-level tasks
- E.g. document-document similarity, topic modeling.
- Pre-trained for many languages
- Uses n-grams instead of words
- Can match unknown words by matching n-grams
- Can also be used for text classification
- Pre-trained for many languages
Notes: