fixed errors (used wrong word in some places...)

rguthrie3 · rguthrie3 · commit 53906941da98 · 2017-03-05T17:03:49.000-05:00
diff --git a/Deep Learning for Natural Language Processing with Pytorch.ipynb b/Deep Learning for Natural Language Processing with Pytorch.ipynb
@@ -661,13 +661,13 @@
    "metadata": {},
    "source": [
     "# 5. Creating Network Components in Pytorch\n",
-    "Before we move on to our focus on NLP, lets do an annotated example of building a network in Pytorch using only affine maps and non-linearities.  We will also see how to compute a loss function, using Pytorch's built in negative log likelihood.\n",
+    "Before we move on to our focus on NLP, lets do an annotated example of building a network in Pytorch using only affine maps and non-linearities.  We will also see how to compute a loss function, using Pytorch's built in negative log likelihood, and update parameters by backpropagation.\n",
     "\n",
     "All network components should inherit from nn.Module and override the forward() method.  That is about it, as far as the boilerplate is concerned.  Inheriting from nn.Module provides functionality to your component.  For example, it makes it keep track of its trainable parameters, you can swap it between CPU and GPU with the .cuda() or .cpu() functions, etc.\n",
     "\n",
     "Let's write an annotated example of a network that takes in a sparse bag-of-words representation and outputs a probability distribution over two labels: \"English\" and \"Spanish\".\n",
     "\n",
-    "Note: This is just for demonstration, so that we can build Pytorch components in later sections and you will know what is going on.  Handing in a sparse bag-of-words representation is not how you would actually want to do things."
+    "Note: This is just for demonstration, so that we can build Pytorch components in later sections and you will know what is going on.  Handing in a sparse bag-of-words representation is not how you would actually want to do things.  There are better ways to do text classification.  I made up this model to be extremely simple and to not use word embeddings (which we don't introduce until the next section)."
    ]
   },
   {
@@ -1040,7 +1040,7 @@
    "metadata": {},
    "source": [
     "How can we solve this problem?  That is, how could we actually encode semantic similarity in words?\n",
-    "Maybe we think up some lexical attributes.  For example, we see that both mathematicians and physicists can run, so maybe we give these words a high score for the \"is able to run\" semantic attribute.  Think of some other attributes, and imagine what you might score some common words on those attributes.\n",
+    "Maybe we think up some semantic attributes.  For example, we see that both mathematicians and physicists can run, so maybe we give these words a high score for the \"is able to run\" semantic attribute.  Think of some other attributes, and imagine what you might score some common words on those attributes.\n",
     "\n",
     "If each attribute is a dimension, then we might give each word a vector, like this:\n",
     "$$ q_\\text{mathematician} = \\left[ \\overbrace{2.3}^\\text{can run},\n",
@@ -1049,16 +1049,21 @@
     "\\overbrace{9.1}^\\text{likes coffee}, \\overbrace{6.4}^\\text{majored in Physics}, \\dots \\right] $$\n",
     "\n",
     "Then we can get a measure of similarity between these words by doing:\n",
-    "$$ \\text{Similarity}(\\text{physicist}, \\text{mathematician}) = q_\\text{physicist} \\cdot q_\\text{mathematician} $$"
+    "$$ \\text{Similarity}(\\text{physicist}, \\text{mathematician}) = q_\\text{physicist} \\cdot q_\\text{mathematician} $$\n",
+    "\n",
+    "Although it is more common to normalize by the lengths:\n",
+    "$$ \\text{Similarity}(\\text{physicist}, \\text{mathematician}) = \\frac{q_\\text{physicist} \\cdot q_\\text{mathematician}}\n",
+    "{\\| q_\\text{\\physicist} \\| \\| q_\\text{mathematician} \\|} = \\cos (\\phi) $$\n",
+    "Where $\\phi$ is the angle between the two vectors.  That way, extremely similar words (words whose embeddings point in the same direction) will have similarity 1.  Extremely dissimilar words should have similarity -1."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "You can think of the sparse one-hot vectors from the beginning of this section as a special case of these new vectors we have defined, where each word basically has similarity 0.\n",
+    "You can think of the sparse one-hot vectors from the beginning of this section as a special case of these new vectors we have defined, where each word basically has similarity 0, and we gave each word some unique semantic attribute.\n",
     "\n",
-    "But these new vectors are a big pain: you could think of thousands of different lexical attributes that might be relevant to determining similarity, and how on earth would you set the values of the different attributes?  Central to the idea of deep learning is that the neural network learns representations of the features, rather than requiring the programmer to design them herself.  So why not just let the word embeddings be parameters in our model, and then be updated during training?  This is exactly what we will do.  We will have some *latent lexical attributes* that the network can, in principle, learn.  Note that the word embeddings will probably not be interpretable.  That is, although with our hand-crafted vectors above we can see that mathematicians and physicists are similar in that they both like coffee, if we allow a neural network to learn the embeddings and see that both mathematicians and physicisits have a large value in the second dimension, it is not clear what that means.  They are similar in some latent semantic dimension, but this probably has no interpretation to us."
+    "But these new vectors are a big pain: you could think of thousands of different semantic attributes that might be relevant to determining similarity, and how on earth would you set the values of the different attributes?  Central to the idea of deep learning is that the neural network learns representations of the features, rather than requiring the programmer to design them herself.  So why not just let the word embeddings be parameters in our model, and then be updated during training?  This is exactly what we will do.  We will have some *latent semantic attributes* that the network can, in principle, learn.  Note that the word embeddings will probably not be interpretable.  That is, although with our hand-crafted vectors above we can see that mathematicians and physicists are similar in that they both like coffee, if we allow a neural network to learn the embeddings and see that both mathematicians and physicisits have a large value in the second dimension, it is not clear what that means.  They are similar in some latent semantic dimension, but this probably has no interpretation to us."
    ]
   },
   {
@@ -1116,7 +1121,7 @@
     "$$ P(w_i | w_{i-1}, w_{i-2}, \\dots, w_{i-n+1} ) $$\n",
     "Where $w_i$ is the ith word of the sequence.\n",
     "\n",
-    "In this example, we will compute the loss function on some training examples.  We won't yet train the network.  We will expand on this example soon."
+    "In this example, we will compute the loss function on some training examples and update the parameters with backpropagation."
    ]
   },
   {
@@ -1689,6 +1694,14 @@
     "print tag_scores"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Example: An LSTM Language Model\n",
+    "TODO"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},