Skip to content

Commit 5442dfb

Browse files
committed
compressive transformer links
1 parent 969df71 commit 5442dfb

File tree

8 files changed

+64
-9
lines changed

8 files changed

+64
-9
lines changed

docs/index.html

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,7 @@ <h4>✨ <a href="transformers/index.html">Transformers</a></h4>
8888
<li><a href="transformers/xl/relative_mha.html">Relative multi-headed attention</a></li>
8989
</ul>
9090
</li>
91+
<li><a href="transformers/compressive/index.html">Compressive Transformer</a></li>
9192
<li><a href="transformers/gpt/index.html">GPT Architecture</a></li>
9293
<li><a href="transformers/glu_variants/simple.html">GLU Variants</a></li>
9394
<li><a href="transformers/knn/index.html">kNN-LM: Generalization through Memorization</a></li>

docs/transformers/index.html

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,10 @@ <h1>Transformers</h1>
8484
<h2><a href="xl/index.html">Transformer XL</a></h2>
8585
<p>This implements Transformer XL model using
8686
<a href="xl/relative_mha.html">relative multi-head attention</a></p>
87+
<h2><a href="compressive/index.html">Compressive Transformer</a></h2>
88+
<p>This is an implementation of compressive transformer
89+
that extends upon <a href="xl/index.html">Transformer XL</a> by compressing
90+
oldest memories to give a longer attention span.</p>
8791
<h2><a href="gpt">GPT Architecture</a></h2>
8892
<p>This is an implementation of GPT-2 architecture.</p>
8993
<h2><a href="glu_variants/simple.html">GLU Variants</a></h2>
@@ -102,10 +106,10 @@ <h2><a href="switch">Switch Transformer</a></h2>
102106
It does single GPU training but we implement the concept of switching as described in the paper.</p>
103107
</div>
104108
<div class='code'>
105-
<div class="highlight"><pre><span class="lineno">52</span><span></span><span class="kn">from</span> <span class="nn">.configs</span> <span class="kn">import</span> <span class="n">TransformerConfigs</span>
106-
<span class="lineno">53</span><span class="kn">from</span> <span class="nn">.models</span> <span class="kn">import</span> <span class="n">TransformerLayer</span><span class="p">,</span> <span class="n">Encoder</span><span class="p">,</span> <span class="n">Decoder</span><span class="p">,</span> <span class="n">Generator</span><span class="p">,</span> <span class="n">EncoderDecoder</span>
107-
<span class="lineno">54</span><span class="kn">from</span> <span class="nn">.mha</span> <span class="kn">import</span> <span class="n">MultiHeadAttention</span>
108-
<span class="lineno">55</span><span class="kn">from</span> <span class="nn">labml_nn.transformers.xl.relative_mha</span> <span class="kn">import</span> <span class="n">RelativeMultiHeadAttention</span></pre></div>
109+
<div class="highlight"><pre><span class="lineno">57</span><span></span><span class="kn">from</span> <span class="nn">.configs</span> <span class="kn">import</span> <span class="n">TransformerConfigs</span>
110+
<span class="lineno">58</span><span class="kn">from</span> <span class="nn">.models</span> <span class="kn">import</span> <span class="n">TransformerLayer</span><span class="p">,</span> <span class="n">Encoder</span><span class="p">,</span> <span class="n">Decoder</span><span class="p">,</span> <span class="n">Generator</span><span class="p">,</span> <span class="n">EncoderDecoder</span>
111+
<span class="lineno">59</span><span class="kn">from</span> <span class="nn">.mha</span> <span class="kn">import</span> <span class="n">MultiHeadAttention</span>
112+
<span class="lineno">60</span><span class="kn">from</span> <span class="nn">labml_nn.transformers.xl.relative_mha</span> <span class="kn">import</span> <span class="n">RelativeMultiHeadAttention</span></pre></div>
109113
</div>
110114
</div>
111115
</div>

docs/transformers/xl/readme.html

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -87,8 +87,8 @@ <h1><a href="https://nn.labml.ai/transformers/xl/index.html">Transformer XL</a><
8787
the same positions as the current context.
8888
They introduce relative positional encoding, where the positional encodings
8989
are introduced at the attention calculation.</p>
90-
<p>Annotated implementation of relative multi-headed attention is in <a href="relative_mha.html"><code>relative_mha.py</code></a>.</p>
91-
<p>Here&rsquo;s <a href="experiment.html">the training code</a> and a notebook for training a transformer XL model on Tiny Shakespeare dataset.</p>
90+
<p>Annotated implementation of relative multi-headed attention is in <a href="https://nn.labml.ai/transformers/xl/relative_mha.html"><code>relative_mha.py</code></a>.</p>
91+
<p>Here&rsquo;s <a href="https://nn.labml.ai/transformers/xl/experiment.html">the training code</a> and a notebook for training a transformer XL model on Tiny Shakespeare dataset.</p>
9292
<p><a href="https://colab.research.google.com/github/lab-ml/nn/blob/master/labml_nn/transformers/xl/experiment.ipynb"><img alt="Open In Colab" src="https://colab.research.google.com/assets/colab-badge.svg" /></a>
9393
<a href="https://web.lab-ml.com/run?uuid=d3b6760c692e11ebb6a70242ac1c0002"><img alt="View Run" src="https://img.shields.io/badge/labml-experiment-brightgreen" /></a></p>
9494
</div>

labml_nn/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
* [Transformer building blocks](transformers/models.html)
2020
* [Transformer XL](transformers/xl/index.html)
2121
* [Relative multi-headed attention](transformers/xl/relative_mha.html)
22+
* [Compressive Transformer](transformers/compressive/index.html)
2223
* [GPT Architecture](transformers/gpt/index.html)
2324
* [GLU Variants](transformers/glu_variants/simple.html)
2425
* [kNN-LM: Generalization through Memorization](transformers/knn/index.html)

labml_nn/transformers/__init__.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,12 @@
2121
This implements Transformer XL model using
2222
[relative multi-head attention](xl/relative_mha.html)
2323
24+
## [Compressive Transformer](compressive/index.html)
25+
26+
This is an implementation of compressive transformer
27+
that extends upon [Transformer XL](xl/index.html) by compressing
28+
oldest memories to give a longer attention span.
29+
2430
## [GPT Architecture](gpt)
2531
2632
This is an implementation of GPT-2 architecture.
@@ -30,7 +36,6 @@
3036
This is an implementation of the paper
3137
[GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202).
3238
33-
3439
## [kNN-LM](knn)
3540
3641
This is an implementation of the paper
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# [Compressive Transformer](https://nn.labml.ai/transformers/compressive/index.html)
2+
3+
This is an implementation of
4+
[Compressive Transformers for Long-Range Sequence Modelling](https://arxiv.org/abs/1911.05507)
5+
in [PyTorch](https://pytorch.org).
6+
7+
This is an extension of [Transformer XL](https://nn.labml.ai/transformers/xl/index.html) where past memories
8+
are compressed to give a longer attention range.
9+
That is, the furthest $n_{cm} c$ memories are compressed into
10+
$n_{cm}$ memories, where $c$ is the compression rate.
11+
12+
## Compression operation
13+
14+
The compression operation is defined as
15+
$f_c: \mathbb{R}^{nc \times d} \rightarrow \mathbb{R}^{n \times d}$.
16+
The paper introduces multiple choices for $f_c$ and we have only implemented
17+
1D convolution which seems to give the best results.
18+
Each layer has a separate compression operation $f_c^{(i)}$ where
19+
$i$ is the layer number.
20+
21+
## Training compression operation
22+
23+
Since training compression with BPTT requires maintaining
24+
a very large computational graph (many time steps), the paper proposes
25+
an *auto-encoding loss* and an *attention reconstruction loss*.
26+
The auto-encoding loss decodes the original memories from the compressed memories
27+
and calculates the loss.
28+
Attention reconstruction loss computes the multi-headed attention results
29+
on the compressed memory and on uncompressed memory and gets a mean squared error
30+
between them.
31+
We have implemented the latter here since it gives better results.
32+
33+
This implementation uses pre-layer normalization
34+
while the paper uses post-layer normalization.
35+
Pre-layer norm does the layer norm before FFN[../feedforward.html) and
36+
self-attention, and the pass-through in the residual connection is not normalized.
37+
This is supposed to be more stable in standard transformer setups.
38+
39+
Here are [the training code](https://nn.labml.ai/transformers/compressive/experiment.html) and a notebook for training a compressive transformer
40+
model on the Tiny Shakespeare dataset.
41+
42+
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lab-ml/nn/blob/master/labml_nn/transformers/compressive/experiment.ipynb)
43+
[![View Run](https://img.shields.io/badge/labml-experiment-brightgreen)](https://web.lab-ml.com/run?uuid=0d9b5338726c11ebb7c80242ac1c0002)

labml_nn/transformers/xl/readme.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,9 +16,9 @@ the same positions as the current context.
1616
They introduce relative positional encoding, where the positional encodings
1717
are introduced at the attention calculation.
1818

19-
Annotated implementation of relative multi-headed attention is in [`relative_mha.py`](relative_mha.html).
19+
Annotated implementation of relative multi-headed attention is in [`relative_mha.py`](https://nn.labml.ai/transformers/xl/relative_mha.html).
2020

21-
Here's [the training code](experiment.html) and a notebook for training a transformer XL model on Tiny Shakespeare dataset.
21+
Here's [the training code](https://nn.labml.ai/transformers/xl/experiment.html) and a notebook for training a transformer XL model on Tiny Shakespeare dataset.
2222

2323
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/lab-ml/nn/blob/master/labml_nn/transformers/xl/experiment.ipynb)
2424
[![View Run](https://img.shields.io/badge/labml-experiment-brightgreen)](https://web.lab-ml.com/run?uuid=d3b6760c692e11ebb6a70242ac1c0002)

readme.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ implementations almost weekly.
2525
* [Transformer building blocks](https://nn.labml.ai/transformers/models.html)
2626
* [Transformer XL](https://nn.labml.ai/transformers/xl/index.html)
2727
* [Relative multi-headed attention](https://nn.labml.ai/transformers/xl/relative_mha.html)
28+
* [Compressive Transformer](https://nn.labml.ai/transformers/compressive/index.html)
2829
* [GPT Architecture](https://nn.labml.ai/transformers/gpt/index.html)
2930
* [GLU Variants](https://nn.labml.ai/transformers/glu_variants/simple.html)
3031
* [kNN-LM: Generalization through Memorization](https://nn.labml.ai/transformers/knn)

0 commit comments

Comments
 (0)