|
485 | 485 | "source": [
|
486 | 486 | "### Image feature extractor\n",
|
487 | 487 | "\n",
|
488 |
| - "You will use an image model (pretrained on imagenet) to extract the features from each image. The model was trained as an image classifier, but setting `include_top=False` returns the model without the final classification layer, so you can use the last layer of feature-maps: \n", |
489 |
| - "\n", |
490 |
| - "\n" |
| 488 | + "You will use an image model (pretrained on imagenet) to extract the features from each image. The model was trained as an image classifier, but setting `include_top=False` returns the model without the final classification layer, so you can use the last layer of feature-maps: \n" |
491 | 489 | ]
|
492 | 490 | },
|
493 | 491 | {
|
|
1052 | 1050 | "id": "qiRXWwIKNybB"
|
1053 | 1051 | },
|
1054 | 1052 | "source": [
|
1055 |
| - "\n", |
1056 |
| - "\n", |
1057 | 1053 | "The model will be implemented in three main parts: \n",
|
1058 | 1054 | "\n",
|
1059 | 1055 | "1. Input - The token embedding and positional encoding (`SeqEmbedding`).\n",
|
|
1163 | 1159 | " attn = self.mha(query=x, value=x,\n",
|
1164 | 1160 | " use_causal_mask=True)\n",
|
1165 | 1161 | " x = self.add([x, attn])\n",
|
1166 |
| - " return self.layernorm(x)\n", |
1167 |
| - "\n" |
| 1162 | + " return self.layernorm(x)\n" |
1168 | 1163 | ]
|
1169 | 1164 | },
|
1170 | 1165 | {
|
|
1304 | 1299 | "id": "6WQD87efena5"
|
1305 | 1300 | },
|
1306 | 1301 | "source": [
|
1307 |
| - "\n", |
1308 |
| - "\n", |
1309 | 1302 | "But there are a few other features you can add to make this work a little better:\n",
|
1310 | 1303 | "\n",
|
1311 | 1304 | "1. **Handle bad tokens**: The model will be generating text. It should\n",
|
|
1483 | 1476 | "1. Flatten the extracted image features, so they can be input to the decoder layers.\n",
|
1484 | 1477 | "2. Look up the token embeddings.\n",
|
1485 | 1478 | "3. Run the stack of `DecoderLayer`s, on the image features and text embeddings.\n",
|
1486 |
| - "4. Run the output layer to predict the next token at each position.\n", |
1487 |
| - "\n" |
| 1479 | + "4. Run the output layer to predict the next token at each position.\n" |
1488 | 1480 | ]
|
1489 | 1481 | },
|
1490 | 1482 | {
|
|
2143 | 2135 | "colab": {
|
2144 | 2136 | "collapsed_sections": [],
|
2145 | 2137 | "name": "image_captioning.ipynb",
|
2146 |
| - "private_outputs": true, |
2147 |
| - "provenance": [], |
2148 | 2138 | "toc_visible": true
|
2149 | 2139 | },
|
2150 | 2140 | "kernelspec": {
|
|
0 commit comments