From e8a5d38d3b91a8195e020a8a06f81701e2d4fbc4 Mon Sep 17 00:00:00 2001
From: Farshid Balaneji <farshidbalan@yahoo.com>
Date: Tue, 16 Jul 2019 08:08:39 +0200
Subject: [PATCH] Created using Colaboratory

---
 word_embeddings_tutorial.ipynb | 454 +++++++++++++++++++++++++++++++++
 1 file changed, 454 insertions(+)
 create mode 100644 word_embeddings_tutorial.ipynb
diff --git a/word_embeddings_tutorial.ipynb b/word_embeddings_tutorial.ipynb
new file mode 100644
index 0000000..45ba157
--- /dev/null
+++ b/word_embeddings_tutorial.ipynb
@@ -0,0 +1,454 @@
+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "name": "word_embeddings_tutorial.ipynb",
+      "version": "0.3.2",
+      "provenance": [],
+      "include_colab_link": true
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.6.6"
+    },
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "view-in-github",
+        "colab_type": "text"
+      },
+      "source": [
+        "<a href=\"https://colab.research.google.com/github/farshidbalan/Udasity-DeepLearning-NanoDegree/blob/master/word_embeddings_tutorial.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "VX3Ymo-FTa_g",
+        "colab_type": "code",
+        "colab": {}
+      },
+      "source": [
+        "%matplotlib inline"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "1jJ0d6SDTa_l",
+        "colab_type": "text"
+      },
+      "source": [
+        "\n",
+        "Word Embeddings: Encoding Lexical Semantics\n",
+        "===========================================\n",
+        "\n",
+        "Word embeddings are dense vectors of real numbers, one per word in your\n",
+        "vocabulary. In NLP, it is almost always the case that your features are\n",
+        "words! But how should you represent a word in a computer? You could\n",
+        "store its ascii character representation, but that only tells you what\n",
+        "the word *is*, it doesn't say much about what it *means* (you might be\n",
+        "able to derive its part of speech from its affixes, or properties from\n",
+        "its capitalization, but not much). Even more, in what sense could you\n",
+        "combine these representations? We often want dense outputs from our\n",
+        "neural networks, where the inputs are $|V|$ dimensional, where\n",
+        "$V$ is our vocabulary, but often the outputs are only a few\n",
+        "dimensional (if we are only predicting a handful of labels, for\n",
+        "instance). How do we get from a massive dimensional space to a smaller\n",
+        "dimensional space?\n",
+        "\n",
+        "How about instead of ascii representations, we use a one-hot encoding?\n",
+        "That is, we represent the word $w$ by\n",
+        "\n",
+        "\\begin{align}\\overbrace{\\left[ 0, 0, \\dots, 1, \\dots, 0, 0 \\right]}^\\text{|V| elements}\\end{align}\n",
+        "\n",
+        "where the 1 is in a location unique to $w$. Any other word will\n",
+        "have a 1 in some other location, and a 0 everywhere else.\n",
+        "\n",
+        "There is an enormous drawback to this representation, besides just how\n",
+        "huge it is. It basically treats all words as independent entities with\n",
+        "no relation to each other. What we really want is some notion of\n",
+        "*similarity* between words. Why? Let's see an example.\n",
+        "\n",
+        "Suppose we are building a language model. Suppose we have seen the\n",
+        "sentences\n",
+        "\n",
+        "* The mathematician ran to the store.\n",
+        "* The physicist ran to the store.\n",
+        "* The mathematician solved the open problem.\n",
+        "\n",
+        "in our training data. Now suppose we get a new sentence never before\n",
+        "seen in our training data:\n",
+        "\n",
+        "* The physicist solved the open problem.\n",
+        "\n",
+        "Our language model might do OK on this sentence, but wouldn't it be much\n",
+        "better if we could use the following two facts:\n",
+        "\n",
+        "* We have seen  mathematician and physicist in the same role in a sentence. Somehow they\n",
+        "  have a semantic relation.\n",
+        "* We have seen mathematician in the same role  in this new unseen sentence\n",
+        "  as we are now seeing physicist.\n",
+        "\n",
+        "and then infer that physicist is actually a good fit in the new unseen\n",
+        "sentence? This is what we mean by a notion of similarity: we mean\n",
+        "*semantic similarity*, not simply having similar orthographic\n",
+        "representations. It is a technique to combat the sparsity of linguistic\n",
+        "data, by connecting the dots between what we have seen and what we\n",
+        "haven't. This example of course relies on a fundamental linguistic\n",
+        "assumption: that words appearing in similar contexts are related to each\n",
+        "other semantically. This is called the `distributional\n",
+        "hypothesis <https://en.wikipedia.org/wiki/Distributional_semantics>`__.\n",
+        "\n",
+        "\n",
+        "Getting Dense Word Embeddings\n",
+        "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n",
+        "\n",
+        "How can we solve this problem? That is, how could we actually encode\n",
+        "semantic similarity in words? Maybe we think up some semantic\n",
+        "attributes. For example, we see that both mathematicians and physicists\n",
+        "can run, so maybe we give these words a high score for the \"is able to\n",
+        "run\" semantic attribute. Think of some other attributes, and imagine\n",
+        "what you might score some common words on those attributes.\n",
+        "\n",
+        "If each attribute is a dimension, then we might give each word a vector,\n",
+        "like this:\n",
+        "\n",
+        "\\begin{align}q_\\text{mathematician} = \\left[ \\overbrace{2.3}^\\text{can run},\n",
+        "   \\overbrace{9.4}^\\text{likes coffee}, \\overbrace{-5.5}^\\text{majored in Physics}, \\dots \\right]\\end{align}\n",
+        "\n",
+        "\\begin{align}q_\\text{physicist} = \\left[ \\overbrace{2.5}^\\text{can run},\n",
+        "   \\overbrace{9.1}^\\text{likes coffee}, \\overbrace{6.4}^\\text{majored in Physics}, \\dots \\right]\\end{align}\n",
+        "\n",
+        "Then we can get a measure of similarity between these words by doing:\n",
+        "\n",
+        "\\begin{align}\\text{Similarity}(\\text{physicist}, \\text{mathematician}) = q_\\text{physicist} \\cdot q_\\text{mathematician}\\end{align}\n",
+        "\n",
+        "Although it is more common to normalize by the lengths:\n",
+        "\n",
+        "\\begin{align}\\text{Similarity}(\\text{physicist}, \\text{mathematician}) = \\frac{q_\\text{physicist} \\cdot q_\\text{mathematician}}\n",
+        "   {\\| q_\\text{\\physicist} \\| \\| q_\\text{mathematician} \\|} = \\cos (\\phi)\\end{align}\n",
+        "\n",
+        "Where $\\phi$ is the angle between the two vectors. That way,\n",
+        "extremely similar words (words whose embeddings point in the same\n",
+        "direction) will have similarity 1. Extremely dissimilar words should\n",
+        "have similarity -1.\n",
+        "\n",
+        "\n",
+        "You can think of the sparse one-hot vectors from the beginning of this\n",
+        "section as a special case of these new vectors we have defined, where\n",
+        "each word basically has similarity 0, and we gave each word some unique\n",
+        "semantic attribute. These new vectors are *dense*, which is to say their\n",
+        "entries are (typically) non-zero.\n",
+        "\n",
+        "But these new vectors are a big pain: you could think of thousands of\n",
+        "different semantic attributes that might be relevant to determining\n",
+        "similarity, and how on earth would you set the values of the different\n",
+        "attributes? Central to the idea of deep learning is that the neural\n",
+        "network learns representations of the features, rather than requiring\n",
+        "the programmer to design them herself. So why not just let the word\n",
+        "embeddings be parameters in our model, and then be updated during\n",
+        "training? This is exactly what we will do. We will have some *latent\n",
+        "semantic attributes* that the network can, in principle, learn. Note\n",
+        "that the word embeddings will probably not be interpretable. That is,\n",
+        "although with our hand-crafted vectors above we can see that\n",
+        "mathematicians and physicists are similar in that they both like coffee,\n",
+        "if we allow a neural network to learn the embeddings and see that both\n",
+        "mathematicians and physicists have a large value in the second\n",
+        "dimension, it is not clear what that means. They are similar in some\n",
+        "latent semantic dimension, but this probably has no interpretation to\n",
+        "us.\n",
+        "\n",
+        "\n",
+        "In summary, **word embeddings are a representation of the *semantics* of\n",
+        "a word, efficiently encoding semantic information that might be relevant\n",
+        "to the task at hand**. You can embed other things too: part of speech\n",
+        "tags, parse trees, anything! The idea of feature embeddings is central\n",
+        "to the field.\n",
+        "\n",
+        "\n",
+        "Word Embeddings in Pytorch\n",
+        "~~~~~~~~~~~~~~~~~~~~~~~~~~\n",
+        "\n",
+        "Before we get to a worked example and an exercise, a few quick notes\n",
+        "about how to use embeddings in Pytorch and in deep learning programming\n",
+        "in general. Similar to how we defined a unique index for each word when\n",
+        "making one-hot vectors, we also need to define an index for each word\n",
+        "when using embeddings. These will be keys into a lookup table. That is,\n",
+        "embeddings are stored as a $|V| \\times D$ matrix, where $D$\n",
+        "is the dimensionality of the embeddings, such that the word assigned\n",
+        "index $i$ has its embedding stored in the $i$'th row of the\n",
+        "matrix. In all of my code, the mapping from words to indices is a\n",
+        "dictionary named word\\_to\\_ix.\n",
+        "\n",
+        "The module that allows you to use embeddings is torch.nn.Embedding,\n",
+        "which takes two arguments: the vocabulary size, and the dimensionality\n",
+        "of the embeddings.\n",
+        "\n",
+        "To index into this table, you must use torch.LongTensor (since the\n",
+        "indices are integers, not floats).\n",
+        "\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "CRD8rg0zTa_m",
+        "colab_type": "code",
+        "colab": {}
+      },
+      "source": [
+        "# Author: Robert Guthrie\n",
+        "\n",
+        "import torch\n",
+        "import torch.nn as nn\n",
+        "import torch.nn.functional as F\n",
+        "import torch.optim as optim\n",
+        "\n",
+        "torch.manual_seed(1)"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "N3QuqvhtTa_s",
+        "colab_type": "code",
+        "colab": {}
+      },
+      "source": [
+        "word_to_ix = {\"hello\": 0, \"world\": 1}\n",
+        "embeds = nn.Embedding(2, 5)  # 2 words in vocab, 5 dimensional embeddings\n",
+        "lookup_tensor = torch.tensor([word_to_ix[\"hello\"]], dtype=torch.long)\n",
+        "hello_embed = embeds(lookup_tensor)\n",
+        "print(hello_embed)"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "oGyMnnTuTa_v",
+        "colab_type": "text"
+      },
+      "source": [
+        "An Example: N-Gram Language Modeling\n",
+        "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n",
+        "\n",
+        "Recall that in an n-gram language model, given a sequence of words\n",
+        "$w$, we want to compute\n",
+        "\n",
+        "\\begin{align}P(w_i | w_{i-1}, w_{i-2}, \\dots, w_{i-n+1} )\\end{align}\n",
+        "\n",
+        "Where $w_i$ is the ith word of the sequence.\n",
+        "\n",
+        "In this example, we will compute the loss function on some training\n",
+        "examples and update the parameters with backpropagation.\n",
+        "\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "gxBdYklXTa_v",
+        "colab_type": "code",
+        "colab": {}
+      },
+      "source": [
+        "CONTEXT_SIZE = 2\n",
+        "EMBEDDING_DIM = 10\n",
+        "# We will use Shakespeare Sonnet 2\n",
+        "test_sentence = \"\"\"When forty winters shall besiege thy brow,\n",
+        "And dig deep trenches in thy beauty's field,\n",
+        "Thy youth's proud livery so gazed on now,\n",
+        "Will be a totter'd weed of small worth held:\n",
+        "Then being asked, where all thy beauty lies,\n",
+        "Where all the treasure of thy lusty days;\n",
+        "To say, within thine own deep sunken eyes,\n",
+        "Were an all-eating shame, and thriftless praise.\n",
+        "How much more praise deserv'd thy beauty's use,\n",
+        "If thou couldst answer 'This fair child of mine\n",
+        "Shall sum my count, and make my old excuse,'\n",
+        "Proving his beauty by succession thine!\n",
+        "This were to be new made when thou art old,\n",
+        "And see thy blood warm when thou feel'st it cold.\"\"\".split()\n",
+        "# we should tokenize the input, but we will ignore that for now\n",
+        "# build a list of tuples.  Each tuple is ([ word_i-2, word_i-1 ], target word)\n",
+        "trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])\n",
+        "            for i in range(len(test_sentence) - 2)]\n",
+        "# print the first 3, just so you can see what they look like\n",
+        "print(trigrams[:3])\n",
+        "\n",
+        "vocab = set(test_sentence)\n",
+        "word_to_ix = {word: i for i, word in enumerate(vocab)}\n",
+        "\n",
+        "\n",
+        "class NGramLanguageModeler(nn.Module):\n",
+        "\n",
+        "    def __init__(self, vocab_size, embedding_dim, context_size):\n",
+        "        super(NGramLanguageModeler, self).__init__()\n",
+        "        self.embeddings = nn.Embedding(vocab_size, embedding_dim)\n",
+        "        self.linear1 = nn.Linear(context_size * embedding_dim, 128)\n",
+        "        self.linear2 = nn.Linear(128, vocab_size)\n",
+        "\n",
+        "    def forward(self, inputs):\n",
+        "        embeds = self.embeddings(inputs).view((1, -1))\n",
+        "        out = F.relu(self.linear1(embeds))\n",
+        "        out = self.linear2(out)\n",
+        "        log_probs = F.log_softmax(out, dim=1)\n",
+        "        return log_probs\n",
+        "\n",
+        "\n",
+        "losses = []\n",
+        "loss_function = nn.NLLLoss()\n",
+        "model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)\n",
+        "optimizer = optim.SGD(model.parameters(), lr=0.001)\n",
+        "\n",
+        "for epoch in range(10):\n",
+        "    total_loss = 0\n",
+        "    for context, target in trigrams:\n",
+        "\n",
+        "        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words\n",
+        "        # into integer indices and wrap them in tensors)\n",
+        "        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)\n",
+        "\n",
+        "        # Step 2. Recall that torch *accumulates* gradients. Before passing in a\n",
+        "        # new instance, you need to zero out the gradients from the old\n",
+        "        # instance\n",
+        "        model.zero_grad()\n",
+        "\n",
+        "        # Step 3. Run the forward pass, getting log probabilities over next\n",
+        "        # words\n",
+        "        log_probs = model(context_idxs)\n",
+        "\n",
+        "        # Step 4. Compute your loss function. (Again, Torch wants the target\n",
+        "        # word wrapped in a tensor)\n",
+        "        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))\n",
+        "\n",
+        "        # Step 5. Do the backward pass and update the gradient\n",
+        "        loss.backward()\n",
+        "        optimizer.step()\n",
+        "\n",
+        "        # Get the Python number from a 1-element Tensor by calling tensor.item()\n",
+        "        total_loss += loss.item()\n",
+        "    losses.append(total_loss)\n",
+        "print(losses)  # The loss decreased every iteration over the training data!"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "u-n7T7UfTa_y",
+        "colab_type": "text"
+      },
+      "source": [
+        "Exercise: Computing Word Embeddings: Continuous Bag-of-Words\n",
+        "~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n",
+        "\n",
+        "The Continuous Bag-of-Words model (CBOW) is frequently used in NLP deep\n",
+        "learning. It is a model that tries to predict words given the context of\n",
+        "a few words before and a few words after the target word. This is\n",
+        "distinct from language modeling, since CBOW is not sequential and does\n",
+        "not have to be probabilistic. Typcially, CBOW is used to quickly train\n",
+        "word embeddings, and these embeddings are used to initialize the\n",
+        "embeddings of some more complicated model. Usually, this is referred to\n",
+        "as *pretraining embeddings*. It almost always helps performance a couple\n",
+        "of percent.\n",
+        "\n",
+        "The CBOW model is as follows. Given a target word $w_i$ and an\n",
+        "$N$ context window on each side, $w_{i-1}, \\dots, w_{i-N}$\n",
+        "and $w_{i+1}, \\dots, w_{i+N}$, referring to all context words\n",
+        "collectively as $C$, CBOW tries to minimize\n",
+        "\n",
+        "\\begin{align}-\\log p(w_i | C) = -\\log \\text{Softmax}(A(\\sum_{w \\in C} q_w) + b)\\end{align}\n",
+        "\n",
+        "where $q_w$ is the embedding for word $w$.\n",
+        "\n",
+        "Implement this model in Pytorch by filling in the class below. Some\n",
+        "tips:\n",
+        "\n",
+        "* Think about which parameters you need to define.\n",
+        "* Make sure you know what shape each operation expects. Use .view() if you need to\n",
+        "  reshape.\n",
+        "\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "metadata": {
+        "id": "6K1mXuSNTa_z",
+        "colab_type": "code",
+        "colab": {}
+      },
+      "source": [
+        "CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right\n",
+        "raw_text = \"\"\"We are about to study the idea of a computational process.\n",
+        "Computational processes are abstract beings that inhabit computers.\n",
+        "As they evolve, processes manipulate other abstract things called data.\n",
+        "The evolution of a process is directed by a pattern of rules\n",
+        "called a program. People create programs to direct processes. In effect,\n",
+        "we conjure the spirits of the computer with our spells.\"\"\".split()\n",
+        "\n",
+        "# By deriving a set from `raw_text`, we deduplicate the array\n",
+        "vocab = set(raw_text)\n",
+        "vocab_size = len(vocab)\n",
+        "\n",
+        "word_to_ix = {word: i for i, word in enumerate(vocab)}\n",
+        "data = []\n",
+        "for i in range(2, len(raw_text) - 2):\n",
+        "    context = [raw_text[i - 2], raw_text[i - 1],\n",
+        "               raw_text[i + 1], raw_text[i + 2]]\n",
+        "    target = raw_text[i]\n",
+        "    data.append((context, target))\n",
+        "print(data[:5])\n",
+        "\n",
+        "\n",
+        "class CBOW(nn.Module):\n",
+        "\n",
+        "    def __init__(self):\n",
+        "        pass\n",
+        "\n",
+        "    def forward(self, inputs):\n",
+        "        pass\n",
+        "\n",
+        "# create your model and train.  here are some functions to help you make\n",
+        "# the data ready for use by your module\n",
+        "\n",
+        "\n",
+        "def make_context_vector(context, word_to_ix):\n",
+        "    idxs = [word_to_ix[w] for w in context]\n",
+        "    return torch.tensor(idxs, dtype=torch.long)\n",
+        "\n",
+        "\n",
+        "make_context_vector(data[0][0], word_to_ix)  # example"
+      ],
+      "execution_count": 0,
+      "outputs": []
+    }
+  ]
+}
\ No newline at end of file