Skip to content

Commit 99e3b34

Browse files
author
Raul Puri
committed
experiment scripts+model uploads+small classifier and license fix
1 parent 20a8f34 commit 99e3b34

18 files changed

+11075
-66
lines changed

README.md

Lines changed: 32 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -39,8 +39,9 @@ The model's performance as a whole will increase as it processes more data.
3939
* [Transformer Training Setup](./analysis/reproduction.md#transformer-training-set-up)
4040
* [FP16 Training](./analysis/reproduction.md#fp16-training)
4141
* [Large Model Training](./analysis/reproduction.md#going-bigger-with-large-models)
42-
* [Transfer](./analysis/reproduction.md#transfer)
42+
* [Sentiment Transfer](./analysis/reproduction.md#transfer)
4343
* [Finetuning Classifiers](./analysis/reproduction.md#finetuning-classifiers)
44+
* [ELMo Comparison](./analysis/reproduction.md#elmo-comparison)
4445
* [Data Parallel Scalability](./analysis/scale.md)
4546
* [PyTorch + GIL](./analysis/scale.md#pytorch-gil)
4647
* [Open Questions](./analysis/questions.md)
@@ -65,18 +66,24 @@ At this time we only support python3.
6566

6667
### Pretrained models
6768
We've included our sentencepiece tokenizer model and vocab as a zip file:
68-
* [sentencepiece tokenizer](https://drive.google.com/file/d/1fxdSrcnpB4OtzmE3PT7E6CtY6seoSSth/view?usp=sharing) [1MB]
69-
70-
These models are temporarily deprecated and do not work with current versions of the repo
71-
We've included our trained 4096-d mlstm models in both fp16 and fp32:
72-
* [Binary SST]() [329MB]
73-
* [Binary SST (FP16)]() [163MB]
74-
* [IMDB]() [329MB]
75-
* [IMDB (FP16)]() [163MB]
76-
77-
We've also included our trained 8192-d mlstm models in fp16:
78-
* [Binary SST (FP16)]() [649 MB]
79-
* [IMDB (FP16)]() [649 MB]
69+
* [sentencepiece tokenizer](https://drive.google.com/open?id=1aw_gKmowfLaGGxSrhRh0jTuC8gWIOtWP) [1MB]
70+
71+
We've included a transformer language model base as well as a 4096-d mlstm language model base. For examples on how to use these models please see our [finetuning](#classifier-finetuning) and [transfer](#sentiment-transfer) sections. Even though these models were trained with FP16 they can be used in FP32 training/inference.
72+
* [FP16 Transformer LM](https://drive.google.com/file/d/1rQfJkHsVJEI2WgvoHzx5Ooxm0CWSjdYt/view?usp=sharing) [311MB]
73+
* [FP16 mLSTM LM](https://drive.google.com/file/d/1EEZCZ_AZX_MlAsV-2GlFqxTT-KaNT3rG/view?usp=sharing) [169MB]
74+
75+
We've also included already trained classification models for SST and IMDB binary sentiment classification:
76+
* [SST Transformer](https://drive.google.com/file/d/1-lxjFuJm_fQ_DvnxU74-35T_M8WjvrQH/view?usp=sharing) [621MB]
77+
* [SST mLSTM](https://drive.google.com/file/d/142dVGcHePvOMSojVYiRxutbYSeLu_9ym/view?usp=sharing) [325MB]
78+
* [IMDB mLSTM](https://drive.google.com/open?id=1efsCIWQPsXwmqORZ-qs-JdtxiPOssAss) [325MB]
79+
80+
and classifiers trained on a subset of SemEval emotions corresponding to the 8 plutchik emotions (anger, anticipation, disgust, fear, joy, sadness, surprise, and trust):
81+
* [Transformer](https://drive.google.com/file/d/1rC6LWGNkHaZkuojCEWDqSKcDGwFMBTYZ/view?usp=sharing) [673MN]
82+
* [mLSTM](https://drive.google.com/file/d/1ieiWFrYBqzBgGPc3R36x9oL7vlj3lt2F/view?usp=sharing) [433MB]
83+
84+
To use classification models that reproduce results from our original large batch language modeling paper please use the following [commit hash and set of models](https://github.com/NVIDIA/sentiment-discovery/tree/7f5ab28918a6fc29318a30f557b9454f0f5cc26a#pretrained-models).
85+
86+
We did not include pretrained models leveraging ELMo. To reproduce our papers' results with ELMo, please see our [available resources](./analysis/reproduction.md#elmo-comparison).
8087

8188
Each file has a dictionary containing a PyTorch `state_dict` consisting of a language model (lm_encoder keys) trained on Amazon reviews and a classifier (classifier key) as well as accompanying `args` necessary to run a model with that `state_dict`.
8289

@@ -118,9 +125,9 @@ python3 -m multiproc pretrain.py
118125
python3 pretrain.py --data ./data/amazon/reviews.json --lazy --loose-json \ #train a model on amazon data
119126
--text-key reviewText --label-key overall --optim Adam --split 1000,1,1
120127
python3 pretrain.py --tokenizer-type SentencePieceTokenizer --vocab-size 32000 \ #train a model with our sentencepiece tokenization
121-
--tokenizer-type bpe --tokenizer-path tokenizer.model
128+
--tokenizer-type bpe --tokenizer-path ama_32k_tokenizer.model
122129
python3 pretrain.py --tokenizer-type SentencePieceTokenizer --vocab-size 32000 \ #train a transformer model with our sentencepiece tokenization
123-
--tokenizer-type bpe --tokenizer-path tokenizer.model --model transformer \
130+
--tokenizer-type bpe --tokenizer-path ama_32k_tokenizer.model --model transformer \
124131
--decoder-layers 12 --decoder-embed-dim 768 --decoder-ffn-embed-dim 3072 \
125132
--decoder-learned-pos --decoder-attention-heads 8
126133
bash ./experiments/train_mlstm_singlenode.sh #run our mLSTM training script on 1 DGX-1V
@@ -140,9 +147,9 @@ Lastly it performs feature selection to try and fit a regression model to the to
140147
By default only one neuron is used for this second regression.
141148

142149
```
143-
python3 transfer.py --load <model>.pt #performs transfer to SST, saves results to `<model>_transfer/` directory
144-
python3 transfer.py --load <model>.pt --neurons 5 #use 5 neurons for the second regression
145-
python3 transfer.py --load <model>.pt --fp16 #run model in fp16 for featurization step
150+
python3 transfer.py --load mlstm.pt #performs transfer to SST, saves results to `<model>_transfer/` directory
151+
python3 transfer.py --load mlstm.pt --neurons 5 #use 5 neurons for the second regression
152+
python3 transfer.py --load mlstm.pt --fp16 #run model in fp16 for featurization step
146153
```
147154

148155
Expected test accuracy for transfering fully trained mLSTM models to sentiment classification for a given mLSTM hidden size:
@@ -159,15 +166,15 @@ This script supports building arbitrary multilable, multilayer, and multihead pe
159166
Lastly this script supports automatically selecting classification thresholds from validation performance. To measure validation performance this script includes more complex metrics including: f1-score, mathew correlation coefficient, jaccard index, recall, precision, and accuracy.
160167

161168
```
162-
python3 finetune_classifier.py --load <model>.pt --lr 2e-5 --aux-lm-loss --aux-lm-loss-weight .02 #finetune mLSTM model on sst (default dataset) with auxiliary loss
163-
python3 finetune_classifier.py --load <model>.pt --automatic-thresholding --threshold-metric f1 #finetune mLSTM model on sst and automatically select classification thresholds based on the validation f1 score
169+
python3 finetune_classifier.py --load mlstm.pt --lr 2e-5 --aux-lm-loss --aux-lm-loss-weight .02 #finetune mLSTM model on sst (default dataset) with auxiliary loss
170+
python3 finetune_classifier.py --load mlstm.pt --automatic-thresholding --threshold-metric f1 #finetune mLSTM model on sst and automatically select classification thresholds based on the validation f1 score
164171
python3 finetune_classifier.py --tokenizer-type SentencePieceTokenizer --vocab-size 32000 \ #finetune transformer with sentencepiece on SST
165-
--tokenizer-type bpe tokenizer-path tokenizer.model --model transformer --lr 2e-5 \
172+
--tokenizer-type bpe tokenizer-path ama_32k_tokenizer.model --model transformer --lr 2e-5 \
166173
--decoder-layers 12 --decoder-embed-dim 768 --decoder-ffn-embed-dim 3072 \
167-
--decoder-learned-pos --decoder-attention-heads 8 --load <model>.pt --use-final-embed
174+
--decoder-learned-pos --decoder-attention-heads 8 --load transformer.pt --use-final-embed
168175
python3 finetune_classifier.py --automatic-thresholding --non-binary-cols l1 l2 l3 --lr 2e-5\ #finetune multilayer classifier with 3 classes and 4 heads per class on some custom dataset and automatically select classfication thresholds
169176
--classifier-hidden-layers 2048 1024 3 --heads-per-class 4 --aux-head-variance-loss-weight 1. #`aux-head-variance-loss-weight` is an auxiliary loss to increase the variance between each of the 4 head's weights
170-
--data <custom_train>.csv --val <custom_val>.csv --test <custom_test>.csv --load <model>.pt
177+
--data <custom_train>.csv --val <custom_val>.csv --test <custom_test>.csv --load mlstm.pt
171178
bash ./experiments/se_transformer_multihead.sh #finetune a multihead transformer on 8 semeval categories
172179
```
173180

@@ -186,8 +193,9 @@ Additional documentation of the command line arguments available for `finetune_c
186193
* [Transformer Training Setup](./analysis/reproduction.md#transformer-training-set-up)
187194
* [FP16 Training](./analysis/reproduction.md#fp16-training)
188195
* [Large Model Training](./analysis/reproduction.md#going-bigger-with-large-models)
189-
* [Transfer](./analysis/reproduction.md#transfer)
196+
* [Sentiment Transfer](./analysis/reproduction.md#transfer)
190197
* [Finetuning Classifiers](./analysis/reproduction.md#finetuning-classifiers)
198+
* [ELMo Comparison](./analysis/reproduction.md#elmo-comparison)
191199
* [Data Parallel Scalability](./analysis/scale.md)
192200
* [PyTorch + GIL](./analysis/scale.md#pytorch-gil)
193201
* [Open Questions](./analysis/questions.md)

analysis/reproduction.md

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ Contrary to results in the OpenAI work the validation reconstruction loss is low
1111
### mLSTM Training Set Up
1212
It took several cycles of trial and error to come up with a result comparable to the original. Some things were not entirely apparent from the paper, key model details were often hidden in one line, and took several tries to get right. Other minutia were found out independently. We've included what we found to work well.
1313
* **Model**: 4096-d mLSTM, 64-d embedding, 256-d output. (we also trained a similarly parameterized lstm)
14-
* **Weight Norm**: applied only to lstm parameters (hidden->hidden/gate weights), not embedding or output.
14+
* **Weight Norm**: Applied only to lstm parameters (hidden->hidden/gate weights), not embedding or output.
1515
* **Optimizer**: Adam
1616
* **Learning Rate**: 5e-4 per batch of 128. Linear Learning rate decay to 0 over course of epoch.
1717
* **Gradient Clipping**: We occassionally ran into problems with destabilizing gradient explosions. Therfore, we clipped our gradients to a maximum of `1.`.
@@ -23,11 +23,12 @@ It took several cycles of trial and error to come up with a result comparable to
2323
* **Hardware**: 8 volta-class gpus
2424
* **Learning Rate Scaling**: We took queues from recent work in training imagenet at scale and leveraged [FAIR's (Goyal et. al 2017)](https://arxiv.org/pdf/1706.02677.pdf) linear scaling rule. However, after sufficient experimentation we found that learning rate scaling did not work well at all batch sizes and we capped our max learning rate at 3e-3. We also found that using a linear decay rate over 100k steps for global batch sizes greater than 2048 worked well in our case.
2525
* **Training Time**: With FP16 training it takes approximately 17 hours to train.
26-
* **Training command**: To run this training experiment run `./experiments/run_mlstm_singlenode.sh`.
26+
* **Training command**: To run this training experiment run `./experiments/train_mlstm_singlenode.sh`.
2727

2828
### Transformer Training Set Up
2929
The transformer model has demonstrated its capabilities in recent work as a state of the art language model for natural language understanding. We similarly leveraged the transformer in our work on [Practical Text Classification With Large Pre-Trained Language Models](https://arxiv.org/abs/1812.01207). The transformer we used was pre trained as follows.
3030
* **Model**: Transformer with 12 layers, 8 attention heads, hidden size of 768, and an embedding size of 3072. Positional embeddings up to length 256 were used.
31+
* **Weight Norm**: Applied only to transformer and output head parameters, not embedding parameters.
3132
* **Optimizer**: Adam
3233
* **Learning Rate**: 1e-4 with cosine annealing schedule
3334
* **Data set**: Aggressively Deduplicated Amazon Review dataset with 1000/1/1 train/test/validation shards. Each of the three sets are internally shuffled.
@@ -37,7 +38,7 @@ The transformer model has demonstrated its capabilities in recent work as a stat
3738
* **Hardware**: 1 DGX-1V with 8 V100 GPUs
3839
* **Learning rate Scaling**: In our experiences we found that learning rate scaling as a function of available compute did not help train our transformer, and that a learning rate of 1e-4 across all global batch sizes was simple and performed well.
3940
* **Training time**: With FP16 training it takes approximately 3 days to train.
40-
* **Training command**: To run this training experiment run `./experiments/run_transformer_singlenode.sh`.
41+
* **Training command**: To run this training experiment run `./experiments/train_transformer_singlenode.sh`.
4142

4243

4344
## FP16 Training
@@ -89,6 +90,18 @@ Results should line up approximately with below.
8990

9091
![SemEval Classifier Results](../figures/semeval_results.png)
9192

93+
## ELMo Comparison
94+
To analyze how our pretraining, transfer, and finetuning methods stack up to other state of the art models and techniques we utilize the publicly available ELMo language model as a baseline. In order to reproduce our results with ELMo please switch to the [ELMo branch](https://github.com/NVIDIA/sentiment-discovery/tree/elmo).
95+
96+
To train a text classifier with ELMo we utilize ELMo as a language model to encode text whose features are passed to a classifier. The classifier can either be a simple linear layer or a more complex multilayer perceptron. The training can either be performed with end to end training of the classifier and language model, or in a transfer learning setting with only the classifier being trained via logistic regression or SGD.
97+
98+
The following training scripts are capable of reproducing our results with ELMo on SST and the SemEval benchmark challenge. In order to run these scripts you must follow the installation instructions in AllenNLP's [ELMo repository](https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md). Note that for finetuning we did not use an auxliary language modeling loss as ELMo is bidirectional and cannot perform Left to Right language modeling normally.
99+
100+
```
101+
bash ./run_elmo_sk_sst.sh #trains a logistic regression classifier on SST with ELMo
102+
bash ./run_elmo_se_multihead.sh #end to end finetuning of ELMo and a multihead MLP on 8 SemEval categories
103+
```
104+
92105
------
93106

94107
[<- Why Unsupervised Language Modeling?](./unsupervised.md) | [Data Parallel Scalability ->](./scale.md)

0 commit comments

Comments
 (0)