BiLSTM CRF commenting and more examples.

rguthrie3 · rguthrie3 · commit 60544a337ad7 · 2017-03-07T15:23:05.000-05:00
diff --git a/Deep Learning for Natural Language Processing with Pytorch.ipynb b/Deep Learning for Natural Language Processing with Pytorch.ipynb
@@ -13,18 +13,18 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": 1,
    "metadata": {
     "collapsed": false
    },
    "outputs": [
     {
      "data": {
       "text/plain": [
-       "<torch._C.Generator at 0x7f3085495af8>"
+       "<torch._C.Generator at 0x10ee09738>"
       ]
      },
-     "execution_count": 2,
+     "execution_count": 1,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -591,6 +591,38 @@
     "$$ f(x) = Ax + b $$ for a matrix $A$ and vectors $x, b$.  The parameters to be learned here are $A$ and $b$.  Often, $b$ is refered to as the *bias* term."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Pytorch and most other deep learning frameworks do things a little differently than traditional linear algebra.  It maps the rows of the input instead of the columns.  That is, the $i$'th row of the output below is the mapping of the $i$'th row of the input under $A$, plus the bias term.  Look at the example below."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Variable containing:\n",
+      "-0.0313  0.3256  0.5485\n",
+      "-0.2189 -0.0064 -0.0617\n",
+      "[torch.FloatTensor of size 2x3]\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "lin = nn.Linear(5, 3) # maps from R^5 to R^3, parameters A, b\n",
+    "data = autograd.Variable( torch.randn(2, 5) ) # data is 2x5.  A maps from 5 to 3... can we map \"data\" under A?\n",
+    "print lin(data) # yes"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -613,6 +645,39 @@
     "A quick note: although you may have learned some neural networks in your intro to AI class where $\\sigma(x)$ was the default non-linearity, typically people shy away from it in practice.  This is because the gradient *vanishes* very quickly as the absolute value of the argument grows.  Small gradients means it is hard to learn.  Most people default to tanh or ReLU."
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Variable containing:\n",
+      "-0.2067  1.0672\n",
+      " 0.1732 -0.6873\n",
+      "[torch.FloatTensor of size 2x2]\n",
+      "\n",
+      "Variable containing:\n",
+      " 0.0000  1.0672\n",
+      " 0.1732  0.0000\n",
+      "[torch.FloatTensor of size 2x2]\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "# In pytorch, most non-linearities are in torch.functional (we have it imported as F)\n",
+    "# Note that non-linearites typically don't have parameters like affine maps do.\n",
+    "# That is, they don't have weights that are updated during training.\n",
+    "data = autograd.Variable( torch.randn(2, 2) )\n",
+    "print data\n",
+    "print F.relu(data)"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -625,6 +690,57 @@
     "You could also think of it as just applying an element-wise exponentiation operator to the input (to make everything non-negative) and then dividing by the normalization constant."
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Variable containing:\n",
+      "-0.2443\n",
+      "-0.5850\n",
+      " 2.0812\n",
+      "-0.1186\n",
+      " 0.4903\n",
+      "[torch.FloatTensor of size 5]\n",
+      "\n",
+      "Variable containing:\n",
+      " 0.0660\n",
+      " 0.0469\n",
+      " 0.6748\n",
+      " 0.0748\n",
+      " 0.1375\n",
+      "[torch.FloatTensor of size 5]\n",
+      "\n",
+      "Variable containing:\n",
+      " 1\n",
+      "[torch.FloatTensor of size 1]\n",
+      "\n",
+      "Variable containing:\n",
+      "-2.7188\n",
+      "-3.0595\n",
+      "-0.3933\n",
+      "-2.5931\n",
+      "-1.9841\n",
+      "[torch.FloatTensor of size 5]\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Softmax is also in torch.functional\n",
+    "data = autograd.Variable( torch.randn(5) )\n",
+    "print data\n",
+    "print F.softmax(data)\n",
+    "print F.softmax(data).sum() # Sums to 1 because it is a distribution!\n",
+    "print F.log_softmax(data) # theres also log_softmax"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -1699,8 +1815,19 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Example: An LSTM Language Model\n",
-    "TODO"
+    "### Exercise: Augmenting the LSTM part-of-speech tagger with character-level features\n",
+    "In the example above, each word had an embedding, which served as the inputs to our sequence model.\n",
+    "Let's augment the word embeddings with a representation derived from the characters of the word.\n",
+    "We expect that this should help significantly, since character-level information like affixes have\n",
+    "a large bearing on part-of-speech.  For example, words with the affix *-ly* are almost always tagged as adverbs in English.\n",
+    "\n",
+    "Do do this, let $c_w$ be the character-level representation of word $w$.  Let $x_w$ be the word embedding as before.\n",
+    "Then the input to our sequence model is the concatenation of $x_w$ and $c_w$.  So if $x_w$ has dimension 5, and $c_w$ dimension 3, then our LSTM should accept an input of dimension 8.\n",
+    "\n",
+    "To get the character level representation, do an LSTM over the characters of a word, and let $c_w$ be the final hidden state of this LSTM.\n",
+    "Hints:\n",
+    "* There are going to be two LSTM's in your new model.  The original one that outputs POS tag scores, and the new one that outputs a character-level representation of each word.\n",
+    "* To do a sequence model over characters, you will have to embed characters.  The character embeddings will be the input to the character LSTM."
    ]
   },
   {
@@ -1745,12 +1872,25 @@
    "source": [
     "For this section, we will see a full, complicated example of a Bi-LSTM Conditional Random Field for named-entity recognition.  Familiarity with CRF's is assumed.  Although this name sounds scary, all the model is is a CRF but where an LSTM provides the features.  This is an advanced model though, far more complicated than any earlier model in this tutorial.  If you want to skip it, that is fine.\n",
     "\n",
-    "TODO explain BiLSTM CRF Here"
+    "Recall that the CRF computes a conditional probability.  Let $y$ be a tag sequence and $x$ an input sequence of words.  Then we compute\n",
+    "$$ P(y|x) = \\frac{\\exp{(\\text{Score}(y)})}{\\sum_{y'} \\exp{(\\text{Score}(y')})} $$\n",
+    "\n",
+    "Where the score is determined by defining some log potentials $\\log \\psi_i(y)$ such that\n",
+    "$$ \\text{Score}(y) = \\sum_i \\log \\psi_i(y) $$\n",
+    "To make the partition function tractable, the potentials must look only at local features.\n",
+    "\n",
+    "In the Bi-LSTM CRF, we define two classes of potentials: emission and transition.  The emission potential for the word at index $i$ comes from the hidden state of the Bi-LSTM at timestep $i$.  The transition scores are stored in a $|T|x|T|$ matrix $\\textbf{P}$, where $T$ is the tag set.\n",
+    "\n",
+    "If the above discussion was too brief, you can check out [this](http://www.cs.columbia.edu/%7Emcollins/crf.pdf) write up from Michael Collins on CRFs.\n",
+    "\n",
+    "The example below implements the forward algorithm in log space to compute the partition function, and the viterbi algorithm to decode.  Backpropagation will compute the gradients automatically for us.  We don't have to do anything by hand.\n",
+    "\n",
+    "The implementation is not optimized.  If you understand what is going on, you'll probably quickly see that iterating over the next tag in the forward algorithm could probably be done in one big operation.  I wanted to code to be more readable."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 67,
+   "execution_count": 8,
    "metadata": {
     "collapsed": false
    },
@@ -1759,26 +1899,27 @@
      "data": {
       "text/plain": [
        "(Variable containing:\n",
-       "  1.8765\n",
+       "  3.1984\n",
        " [torch.FloatTensor of size 1], [2, 1, 2])"
       ]
      },
-     "execution_count": 67,
+     "execution_count": 8,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
-    "# Work in progress.  Needs extensive commenting but it runs.\n",
-    "\n",
-    "\n",
+    "# Helper functions to make the code more readable.\n",
     "def to_scalar(var):\n",
+    "    # returns a python float\n",
     "    return var.view(-1).data.tolist()[0]\n",
     "\n",
     "def argmax(vec):\n",
+    "    # return the argmax as a python int\n",
     "    _, idx = torch.max(vec, 1)\n",
     "    return to_scalar(idx)\n",
     "\n",
+    "# Compute log sum exp in a numerically stable way for the forward algorithm\n",
     "def log_sum_exp(vec):\n",
     "    max_score = vec[0][argmax(vec)]\n",
     "    max_score_broadcast = max_score.expand(vec.size()[1])\n",
@@ -1797,7 +1938,11 @@
     "        \n",
     "        self.word_embeds = nn.Embedding(vocab_size, embedding_dim)\n",
     "        self.lstm = nn.LSTM(embedding_dim, hidden_dim/2, num_layers=1, bidirectional=True)\n",
+    "        \n",
+    "        # Maps the output of the LSTM into tag space.\n",
     "        self.hidden2tag = nn.Linear(hidden_dim, self.tagset_size)\n",
+    "        \n",
+    "        # Matrix of transition parameters.  Entry i,j is the score of transitioning *to* i *from* j.\n",
     "        self.transitions = nn.Parameter(torch.randn(self.tagset_size, self.tagset_size))\n",
     "        \n",
     "        self.hidden = self.init_hidden()\n",
@@ -1808,17 +1953,26 @@
     "    \n",
     "    \n",
     "    def _forward_alg(self, feats):\n",
+    "        # Do the forward algorithm to compute the partition function\n",
     "        init_alphas = torch.Tensor(1, self.tagset_size).fill_(-10000.)\n",
+    "        # START_TAG has all of the score.\n",
     "        init_alphas[0][self.tag_to_ix[START_TAG]] = 0.\n",
     "        \n",
+    "        # Wrap in a variable so that we will get automatic backprop\n",
     "        forward_var = autograd.Variable(init_alphas)\n",
     "        \n",
+    "        # Iterate through the sentence\n",
     "        for feat in feats:\n",
-    "            alphas_t = []\n",
+    "            alphas_t = [] # The forward variables at this timestep\n",
     "            for next_tag in xrange(self.tagset_size):\n",
+    "                # broadcast the emission score: it is the same regardless of the previous tag\n",
     "                emit_score = feat[next_tag].expand(self.tagset_size)\n",
+    "                # the ith entry of trans_score is the score of transitioning to next_tag from i\n",
     "                trans_score = self.transitions[next_tag]\n",
+    "                # The ith entry of next_tag_var is the value for the edge (i -> next_tag)\n",
+    "                # before we do log-sum-exp\n",
     "                next_tag_var = forward_var + trans_score + emit_score\n",
+    "                # The forward variable for this tag is log-sum-exp of all the scores.\n",
     "                alphas_t.append(log_sum_exp(next_tag_var))\n",
     "            forward_var = torch.cat(alphas_t).view(1, -1)\n",
     "        terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]\n",
@@ -1833,6 +1987,7 @@
     "        return lstm_feats\n",
     "        \n",
     "    def _score_sentence(self, feats, tags):\n",
+    "        # Gives the score of a provided tag sequence\n",
     "        score = autograd.Variable( torch.Tensor([0]) )\n",
     "        tags = [self.tag_to_ix[START_TAG]] + tags\n",
     "        for i, feat in enumerate(feats):\n",
@@ -1870,18 +2025,21 @@
     "        best_path.reverse()\n",
     "        return best_path, path_score\n",
     "        \n",
-    "    def log_likelihood(self, sentence, tags):\n",
+    "    def neg_log_likelihood(self, sentence, tags):\n",
     "        feats = self._get_lstm_features(sentence)\n",
     "        forward_score = self._forward_alg(feats)\n",
     "        gold_score = self._score_sentence(feats, tags)\n",
-    "        return gold_score - forward_score\n",
+    "        return -(gold_score - forward_score)\n",
     "        \n",
-    "    def forward(self, sentence):\n",
+    "    def forward(self, sentence): # dont confuse this with _forward_alg above.\n",
     "        embeds = self.word_embeds(sentence).view(len(sentence), 1, -1)\n",
+    "        # Get the emission features from the LSTM\n",
     "        lstm_out, self.hidden = self.lstm(embeds)\n",
     "        lstm_out = lstm_out.view(len(sentence), self.hidden_dim)\n",
+    "        # Map into tag space\n",
     "        lstm_feats = self.hidden2tag(lstm_out)\n",
     "        \n",
+    "        # Find the best path, given the features.\n",
     "        tag_seq, score = self._viterbi_decode(lstm_feats)\n",
     "        return score, tag_seq\n",
     "\n",
@@ -1894,21 +2052,12 @@
     "model = BiLSTM_CRF(2, tag_to_ix, 4, 6)\n",
     "model(idxs)\n"
    ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "collapsed": true
-   },
-   "outputs": [],
-   "source": []
   }
  ],
  "metadata": {
   "kernelspec": {
    "display_name": "Python 2",
-   "language": "python2",
+   "language": "python",
    "name": "python2"
   },
   "language_info": {
diff --git a/README.md b/README.md
@@ -18,11 +18,10 @@ There are plenty of other tutorials out there, but they all seem to have one of
   * Exercise: Continuous Bag-of-Words for learning word embeddings
 7. Sequence modeling and Long-Short Term Memory Networks
   * Example: An LSTM for Part-of-Speech Tagging
-  * Exercise: LSTM Language Modeling
+  * Exercise: Augmenting the LSTM tagger with character-level features
 8. Advanced: Making Dynamic Decisions
   * Example: Bi-LSTM Conditional Random Field for named-entity recognition
-  
+
 # To do:
 * Add decoding to the LSTM POS tagger example
 * Comment the Bi LSTM CRF example and provide more discussion
-* Write the LSTM LM exercise (I might change this: there are tons of LSTM LM examples out there...)