From e8a5d38d3b91a8195e020a8a06f81701e2d4fbc4 Mon Sep 17 00:00:00 2001 From: Farshid Balaneji Date: Tue, 16 Jul 2019 08:08:39 +0200 Subject: [PATCH] Created using Colaboratory --- word_embeddings_tutorial.ipynb | 454 +++++++++++++++++++++++++++++++++ 1 file changed, 454 insertions(+) create mode 100644 word_embeddings_tutorial.ipynb diff --git a/word_embeddings_tutorial.ipynb b/word_embeddings_tutorial.ipynb new file mode 100644 index 0000000..45ba157 --- /dev/null +++ b/word_embeddings_tutorial.ipynb @@ -0,0 +1,454 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "word_embeddings_tutorial.ipynb", + "version": "0.3.2", + "provenance": [], + "include_colab_link": true + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.6" + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + } + }, + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "view-in-github", + "colab_type": "text" + }, + "source": [ + "\"Open" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "VX3Ymo-FTa_g", + "colab_type": "code", + "colab": {} + }, + "source": [ + "%matplotlib inline" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1jJ0d6SDTa_l", + "colab_type": "text" + }, + "source": [ + "\n", + "Word Embeddings: Encoding Lexical Semantics\n", + "===========================================\n", + "\n", + "Word embeddings are dense vectors of real numbers, one per word in your\n", + "vocabulary. In NLP, it is almost always the case that your features are\n", + "words! But how should you represent a word in a computer? You could\n", + "store its ascii character representation, but that only tells you what\n", + "the word *is*, it doesn't say much about what it *means* (you might be\n", + "able to derive its part of speech from its affixes, or properties from\n", + "its capitalization, but not much). Even more, in what sense could you\n", + "combine these representations? We often want dense outputs from our\n", + "neural networks, where the inputs are $|V|$ dimensional, where\n", + "$V$ is our vocabulary, but often the outputs are only a few\n", + "dimensional (if we are only predicting a handful of labels, for\n", + "instance). How do we get from a massive dimensional space to a smaller\n", + "dimensional space?\n", + "\n", + "How about instead of ascii representations, we use a one-hot encoding?\n", + "That is, we represent the word $w$ by\n", + "\n", + "\\begin{align}\\overbrace{\\left[ 0, 0, \\dots, 1, \\dots, 0, 0 \\right]}^\\text{|V| elements}\\end{align}\n", + "\n", + "where the 1 is in a location unique to $w$. Any other word will\n", + "have a 1 in some other location, and a 0 everywhere else.\n", + "\n", + "There is an enormous drawback to this representation, besides just how\n", + "huge it is. It basically treats all words as independent entities with\n", + "no relation to each other. What we really want is some notion of\n", + "*similarity* between words. Why? Let's see an example.\n", + "\n", + "Suppose we are building a language model. Suppose we have seen the\n", + "sentences\n", + "\n", + "* The mathematician ran to the store.\n", + "* The physicist ran to the store.\n", + "* The mathematician solved the open problem.\n", + "\n", + "in our training data. Now suppose we get a new sentence never before\n", + "seen in our training data:\n", + "\n", + "* The physicist solved the open problem.\n", + "\n", + "Our language model might do OK on this sentence, but wouldn't it be much\n", + "better if we could use the following two facts:\n", + "\n", + "* We have seen mathematician and physicist in the same role in a sentence. Somehow they\n", + " have a semantic relation.\n", + "* We have seen mathematician in the same role in this new unseen sentence\n", + " as we are now seeing physicist.\n", + "\n", + "and then infer that physicist is actually a good fit in the new unseen\n", + "sentence? This is what we mean by a notion of similarity: we mean\n", + "*semantic similarity*, not simply having similar orthographic\n", + "representations. It is a technique to combat the sparsity of linguistic\n", + "data, by connecting the dots between what we have seen and what we\n", + "haven't. This example of course relies on a fundamental linguistic\n", + "assumption: that words appearing in similar contexts are related to each\n", + "other semantically. This is called the `distributional\n", + "hypothesis `__.\n", + "\n", + "\n", + "Getting Dense Word Embeddings\n", + "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n", + "\n", + "How can we solve this problem? That is, how could we actually encode\n", + "semantic similarity in words? Maybe we think up some semantic\n", + "attributes. For example, we see that both mathematicians and physicists\n", + "can run, so maybe we give these words a high score for the \"is able to\n", + "run\" semantic attribute. Think of some other attributes, and imagine\n", + "what you might score some common words on those attributes.\n", + "\n", + "If each attribute is a dimension, then we might give each word a vector,\n", + "like this:\n", + "\n", + "\\begin{align}q_\\text{mathematician} = \\left[ \\overbrace{2.3}^\\text{can run},\n", + " \\overbrace{9.4}^\\text{likes coffee}, \\overbrace{-5.5}^\\text{majored in Physics}, \\dots \\right]\\end{align}\n", + "\n", + "\\begin{align}q_\\text{physicist} = \\left[ \\overbrace{2.5}^\\text{can run},\n", + " \\overbrace{9.1}^\\text{likes coffee}, \\overbrace{6.4}^\\text{majored in Physics}, \\dots \\right]\\end{align}\n", + "\n", + "Then we can get a measure of similarity between these words by doing:\n", + "\n", + "\\begin{align}\\text{Similarity}(\\text{physicist}, \\text{mathematician}) = q_\\text{physicist} \\cdot q_\\text{mathematician}\\end{align}\n", + "\n", + "Although it is more common to normalize by the lengths:\n", + "\n", + "\\begin{align}\\text{Similarity}(\\text{physicist}, \\text{mathematician}) = \\frac{q_\\text{physicist} \\cdot q_\\text{mathematician}}\n", + " {\\| q_\\text{\\physicist} \\| \\| q_\\text{mathematician} \\|} = \\cos (\\phi)\\end{align}\n", + "\n", + "Where $\\phi$ is the angle between the two vectors. That way,\n", + "extremely similar words (words whose embeddings point in the same\n", + "direction) will have similarity 1. Extremely dissimilar words should\n", + "have similarity -1.\n", + "\n", + "\n", + "You can think of the sparse one-hot vectors from the beginning of this\n", + "section as a special case of these new vectors we have defined, where\n", + "each word basically has similarity 0, and we gave each word some unique\n", + "semantic attribute. These new vectors are *dense*, which is to say their\n", + "entries are (typically) non-zero.\n", + "\n", + "But these new vectors are a big pain: you could think of thousands of\n", + "different semantic attributes that might be relevant to determining\n", + "similarity, and how on earth would you set the values of the different\n", + "attributes? Central to the idea of deep learning is that the neural\n", + "network learns representations of the features, rather than requiring\n", + "the programmer to design them herself. So why not just let the word\n", + "embeddings be parameters in our model, and then be updated during\n", + "training? This is exactly what we will do. We will have some *latent\n", + "semantic attributes* that the network can, in principle, learn. Note\n", + "that the word embeddings will probably not be interpretable. That is,\n", + "although with our hand-crafted vectors above we can see that\n", + "mathematicians and physicists are similar in that they both like coffee,\n", + "if we allow a neural network to learn the embeddings and see that both\n", + "mathematicians and physicists have a large value in the second\n", + "dimension, it is not clear what that means. They are similar in some\n", + "latent semantic dimension, but this probably has no interpretation to\n", + "us.\n", + "\n", + "\n", + "In summary, **word embeddings are a representation of the *semantics* of\n", + "a word, efficiently encoding semantic information that might be relevant\n", + "to the task at hand**. You can embed other things too: part of speech\n", + "tags, parse trees, anything! The idea of feature embeddings is central\n", + "to the field.\n", + "\n", + "\n", + "Word Embeddings in Pytorch\n", + "~~~~~~~~~~~~~~~~~~~~~~~~~~\n", + "\n", + "Before we get to a worked example and an exercise, a few quick notes\n", + "about how to use embeddings in Pytorch and in deep learning programming\n", + "in general. Similar to how we defined a unique index for each word when\n", + "making one-hot vectors, we also need to define an index for each word\n", + "when using embeddings. These will be keys into a lookup table. That is,\n", + "embeddings are stored as a $|V| \\times D$ matrix, where $D$\n", + "is the dimensionality of the embeddings, such that the word assigned\n", + "index $i$ has its embedding stored in the $i$'th row of the\n", + "matrix. In all of my code, the mapping from words to indices is a\n", + "dictionary named word\\_to\\_ix.\n", + "\n", + "The module that allows you to use embeddings is torch.nn.Embedding,\n", + "which takes two arguments: the vocabulary size, and the dimensionality\n", + "of the embeddings.\n", + "\n", + "To index into this table, you must use torch.LongTensor (since the\n", + "indices are integers, not floats).\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "CRD8rg0zTa_m", + "colab_type": "code", + "colab": {} + }, + "source": [ + "# Author: Robert Guthrie\n", + "\n", + "import torch\n", + "import torch.nn as nn\n", + "import torch.nn.functional as F\n", + "import torch.optim as optim\n", + "\n", + "torch.manual_seed(1)" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": { + "id": "N3QuqvhtTa_s", + "colab_type": "code", + "colab": {} + }, + "source": [ + "word_to_ix = {\"hello\": 0, \"world\": 1}\n", + "embeds = nn.Embedding(2, 5) # 2 words in vocab, 5 dimensional embeddings\n", + "lookup_tensor = torch.tensor([word_to_ix[\"hello\"]], dtype=torch.long)\n", + "hello_embed = embeds(lookup_tensor)\n", + "print(hello_embed)" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oGyMnnTuTa_v", + "colab_type": "text" + }, + "source": [ + "An Example: N-Gram Language Modeling\n", + "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n", + "\n", + "Recall that in an n-gram language model, given a sequence of words\n", + "$w$, we want to compute\n", + "\n", + "\\begin{align}P(w_i | w_{i-1}, w_{i-2}, \\dots, w_{i-n+1} )\\end{align}\n", + "\n", + "Where $w_i$ is the ith word of the sequence.\n", + "\n", + "In this example, we will compute the loss function on some training\n", + "examples and update the parameters with backpropagation.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "gxBdYklXTa_v", + "colab_type": "code", + "colab": {} + }, + "source": [ + "CONTEXT_SIZE = 2\n", + "EMBEDDING_DIM = 10\n", + "# We will use Shakespeare Sonnet 2\n", + "test_sentence = \"\"\"When forty winters shall besiege thy brow,\n", + "And dig deep trenches in thy beauty's field,\n", + "Thy youth's proud livery so gazed on now,\n", + "Will be a totter'd weed of small worth held:\n", + "Then being asked, where all thy beauty lies,\n", + "Where all the treasure of thy lusty days;\n", + "To say, within thine own deep sunken eyes,\n", + "Were an all-eating shame, and thriftless praise.\n", + "How much more praise deserv'd thy beauty's use,\n", + "If thou couldst answer 'This fair child of mine\n", + "Shall sum my count, and make my old excuse,'\n", + "Proving his beauty by succession thine!\n", + "This were to be new made when thou art old,\n", + "And see thy blood warm when thou feel'st it cold.\"\"\".split()\n", + "# we should tokenize the input, but we will ignore that for now\n", + "# build a list of tuples. Each tuple is ([ word_i-2, word_i-1 ], target word)\n", + "trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])\n", + " for i in range(len(test_sentence) - 2)]\n", + "# print the first 3, just so you can see what they look like\n", + "print(trigrams[:3])\n", + "\n", + "vocab = set(test_sentence)\n", + "word_to_ix = {word: i for i, word in enumerate(vocab)}\n", + "\n", + "\n", + "class NGramLanguageModeler(nn.Module):\n", + "\n", + " def __init__(self, vocab_size, embedding_dim, context_size):\n", + " super(NGramLanguageModeler, self).__init__()\n", + " self.embeddings = nn.Embedding(vocab_size, embedding_dim)\n", + " self.linear1 = nn.Linear(context_size * embedding_dim, 128)\n", + " self.linear2 = nn.Linear(128, vocab_size)\n", + "\n", + " def forward(self, inputs):\n", + " embeds = self.embeddings(inputs).view((1, -1))\n", + " out = F.relu(self.linear1(embeds))\n", + " out = self.linear2(out)\n", + " log_probs = F.log_softmax(out, dim=1)\n", + " return log_probs\n", + "\n", + "\n", + "losses = []\n", + "loss_function = nn.NLLLoss()\n", + "model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)\n", + "optimizer = optim.SGD(model.parameters(), lr=0.001)\n", + "\n", + "for epoch in range(10):\n", + " total_loss = 0\n", + " for context, target in trigrams:\n", + "\n", + " # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words\n", + " # into integer indices and wrap them in tensors)\n", + " context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)\n", + "\n", + " # Step 2. Recall that torch *accumulates* gradients. Before passing in a\n", + " # new instance, you need to zero out the gradients from the old\n", + " # instance\n", + " model.zero_grad()\n", + "\n", + " # Step 3. Run the forward pass, getting log probabilities over next\n", + " # words\n", + " log_probs = model(context_idxs)\n", + "\n", + " # Step 4. Compute your loss function. (Again, Torch wants the target\n", + " # word wrapped in a tensor)\n", + " loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))\n", + "\n", + " # Step 5. Do the backward pass and update the gradient\n", + " loss.backward()\n", + " optimizer.step()\n", + "\n", + " # Get the Python number from a 1-element Tensor by calling tensor.item()\n", + " total_loss += loss.item()\n", + " losses.append(total_loss)\n", + "print(losses) # The loss decreased every iteration over the training data!" + ], + "execution_count": 0, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u-n7T7UfTa_y", + "colab_type": "text" + }, + "source": [ + "Exercise: Computing Word Embeddings: Continuous Bag-of-Words\n", + "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n", + "\n", + "The Continuous Bag-of-Words model (CBOW) is frequently used in NLP deep\n", + "learning. It is a model that tries to predict words given the context of\n", + "a few words before and a few words after the target word. This is\n", + "distinct from language modeling, since CBOW is not sequential and does\n", + "not have to be probabilistic. Typcially, CBOW is used to quickly train\n", + "word embeddings, and these embeddings are used to initialize the\n", + "embeddings of some more complicated model. Usually, this is referred to\n", + "as *pretraining embeddings*. It almost always helps performance a couple\n", + "of percent.\n", + "\n", + "The CBOW model is as follows. Given a target word $w_i$ and an\n", + "$N$ context window on each side, $w_{i-1}, \\dots, w_{i-N}$\n", + "and $w_{i+1}, \\dots, w_{i+N}$, referring to all context words\n", + "collectively as $C$, CBOW tries to minimize\n", + "\n", + "\\begin{align}-\\log p(w_i | C) = -\\log \\text{Softmax}(A(\\sum_{w \\in C} q_w) + b)\\end{align}\n", + "\n", + "where $q_w$ is the embedding for word $w$.\n", + "\n", + "Implement this model in Pytorch by filling in the class below. Some\n", + "tips:\n", + "\n", + "* Think about which parameters you need to define.\n", + "* Make sure you know what shape each operation expects. Use .view() if you need to\n", + " reshape.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "metadata": { + "id": "6K1mXuSNTa_z", + "colab_type": "code", + "colab": {} + }, + "source": [ + "CONTEXT_SIZE = 2 # 2 words to the left, 2 to the right\n", + "raw_text = \"\"\"We are about to study the idea of a computational process.\n", + "Computational processes are abstract beings that inhabit computers.\n", + "As they evolve, processes manipulate other abstract things called data.\n", + "The evolution of a process is directed by a pattern of rules\n", + "called a program. People create programs to direct processes. In effect,\n", + "we conjure the spirits of the computer with our spells.\"\"\".split()\n", + "\n", + "# By deriving a set from `raw_text`, we deduplicate the array\n", + "vocab = set(raw_text)\n", + "vocab_size = len(vocab)\n", + "\n", + "word_to_ix = {word: i for i, word in enumerate(vocab)}\n", + "data = []\n", + "for i in range(2, len(raw_text) - 2):\n", + " context = [raw_text[i - 2], raw_text[i - 1],\n", + " raw_text[i + 1], raw_text[i + 2]]\n", + " target = raw_text[i]\n", + " data.append((context, target))\n", + "print(data[:5])\n", + "\n", + "\n", + "class CBOW(nn.Module):\n", + "\n", + " def __init__(self):\n", + " pass\n", + "\n", + " def forward(self, inputs):\n", + " pass\n", + "\n", + "# create your model and train. here are some functions to help you make\n", + "# the data ready for use by your module\n", + "\n", + "\n", + "def make_context_vector(context, word_to_ix):\n", + " idxs = [word_to_ix[w] for w in context]\n", + " return torch.tensor(idxs, dtype=torch.long)\n", + "\n", + "\n", + "make_context_vector(data[0][0], word_to_ix) # example" + ], + "execution_count": 0, + "outputs": [] + } + ] +} \ No newline at end of file