|
| 1 | + |
| 2 | +# Symbolic Discovery of Optimization Algorithms |
| 3 | + |
| 4 | +This repository contains PyTorch and JAX implementations of the Lion optimizer discovered by symbnolic program search. Lion is also available on [Praxis](https://github.com/google/praxis/blob/main/praxis/optimizers.py). |
| 5 | + |
| 6 | +## Table of Contents |
| 7 | + |
| 8 | +- [Simple, memory efficient, faster runtime](#simple-memory-efficient-faster-runtime) |
| 9 | +- Superior performance on various architectures, tasks, and domains |
| 10 | + - [Image classification](#image-classification) |
| 11 | + - [Vision-language constrastive training](#vision-language-contrastive-training) |
| 12 | + - [Diffusion model](#diffusion-model) |
| 13 | + - [Language modeling](#language-modeling) |
| 14 | +- [Hyperparameter and batch size choices](#hyperparameter-and-batch-size-choices) |
| 15 | + |
| 16 | +## Simple, memory efficient, faster runtime |
| 17 | + |
| 18 | +Compared to AdamW and various adaptive optimizers that need to save both first and second moments, Lion only needs the momentum, halving the additional memory footprint. This is beneficial when training large models and / or with a large batch size. As an example, AdamW needs at least 16 TPU V4 chips to train a ViT-B/16 with image size 224 and batch size 4,096, while Lion only needs eight. |
| 19 | +Another practical benefit is that Lion has faster runtime (steps / sec) in our experiment due to its simplicity, usually 2-15\% speedup compared to AdamW and Adafactor depending on the task, codebase, and hardware. |
| 20 | + |
| 21 | +<img src="./fig/alg.png" width="80%"> |
| 22 | + |
| 23 | +## Superior performance on various architectures, tasks, and domains |
| 24 | + |
| 25 | +### **Image classification** |
| 26 | + |
| 27 | +- Lion outperforms AdamW on various architectures trained from scratch on ImageNet or pre-trained on ImageNet-21K. |
| 28 | + |
| 29 | +<img src="./fig/i1k.png" width="60%"> |
| 30 | + |
| 31 | +- Lion saves up to 5x the pre-training cost on JFT-300M. |
| 32 | + |
| 33 | +<img src="./fig/jft-ft.png" width="100%"> |
| 34 | + |
| 35 | +- Results after fine-tuning with higher resolution and Polyak averaging. |
| 36 | +Our obtained ViT-L/16 matches the previous ViT-H/14 results trained by AdamW while being 2x smaller. |
| 37 | +Our ViT-G/14 further achieves a 90.71% accuracy on ImageNet. |
| 38 | + |
| 39 | +<img src="./fig/jft.png" width="80%"> |
| 40 | + |
| 41 | +### **Vision-language contrastive training** |
| 42 | + |
| 43 | +- On LiT, Lion beats AdamW on zero-shot image classification and image-text retrieval. |
| 44 | + |
| 45 | +<img src="./fig/lit.png" width="45%"> |
| 46 | +<img src="./fig/retrieval.png" width="100%"> |
| 47 | + |
| 48 | +- On BASIC-L, Lion achieves 88.3% zero-shot and 91.1% fine-tuning ImageNet accuracy, surpassing the previous best results by 2% and 0.1%, respectively. |
| 49 | + |
| 50 | +<img src="./fig/basic.png" width="65%"> |
| 51 | + |
| 52 | +### **Diffusion model** |
| 53 | + |
| 54 | +- On diffusion models, Lion exceeds AdamW in terms of the FID score and saves up to 2.3x the training compute. Left to right: 64x64, 128x128, 256x256 image generation trained on ImageNet. |
| 55 | + |
| 56 | +<img src="./fig/diffusion.png" width="100%"> |
| 57 | + |
| 58 | +### **Language modeling** |
| 59 | + |
| 60 | +- Lion saves up to 2x compute on the validation perplexity when performing the language modeling task (Left: on Wiki-40B, Right: on PG-19). Lion achieves larger gains on larger Transformers. |
| 61 | + |
| 62 | +<img src="./fig/lm.png" width="70%"> |
| 63 | + |
| 64 | +- Lion achieves better average in-context learning ability when traininhg LLMs compared to Adafactor. |
| 65 | + |
| 66 | +<img src="./fig/llm.png" width="70%"> |
| 67 | + |
| 68 | +- Lion is also better when fine-tuning T5 on GLUE. |
| 69 | + |
| 70 | +<img src="./fig/ft.png" width="90%"> |
| 71 | + |
| 72 | +## Hyperparameter and batch size choices |
| 73 | + |
| 74 | +- Lion is simple and has fewer hyperparameters compared to AdamW and Adafactor as it does not require $\epsilon$ and factorization-related ones. |
| 75 | +To ensure a fair comparison, we tune the peak learning rate $lr$ and decoupled weight decay $\lambda$ for both AdamW (Adafactor) and our Lion using a logarithmic scale. |
| 76 | +The default values for $\beta_1$ and $\beta_2$ in AdamW are set as 0.9 and 0.999, respectively, with an $\epsilon$ of $1e-8$, while in Lion, the default values for $\beta_1$ and $\beta_2$ are discovered through the program search process and set as 0.9 and 0.99, respectively. |
| 77 | +We only tune those hyperparameters in language tasks, where $\beta_1=0.9$, $\beta_2=0.99$ in AdamW, and $\beta_1=0.95$, $\beta_2=0.98$ in Lion. Additionally, the $\epsilon$ in AdamW is set as $1e-6$ instead of the default $1e-8$ as it improves stability in our experiments, similar to the observations in RoBERTa. |
| 78 | + |
| 79 | +- The update generated by Lion is an element-wise binary $\pm 1$, as a result of the sign operation, therefore it has a larger norm than those generated by other optimizers. |
| 80 | +Based on our experience, a suitable learning rate for Lion is typically 10x smaller than that for AdamW, although sometimes a learning rate that is 3x smaller may perform slightly better. |
| 81 | +Since the effective weight decay is $lr * \lambda$, the value of $\lambda$ used for Lion is 10x larger than that for AdamW in order to maintain a similar strength. |
| 82 | +For instance, |
| 83 | + - $lr=1e-4$, $\lambda=10.0$ in Lion and $lr=1e-3$, $\lambda=1.0$ in AdamW when training ViT-B/16 on ImageNet with strong augmentations. |
| 84 | + - $lr=3e-5$, $\lambda=0.1$ in Lion and $lr=3e-4$, $\lambda=0.01$ in AdamW for diffusion models. |
| 85 | + - $lr=1e-4$, $\lambda=0.01$ in Lion and $lr=1e-3$, $\lambda=0.001$ in Adafactor for the 7.5B language modeling. |
| 86 | + |
| 87 | + Please see our paper for all hyperparameters. |
| 88 | + |
| 89 | +- Apart from the peak performance, the sensitivity to hyperparameters and the difficulty in tuning them and are also critical for the adoption of an optimizer in practice. In the figure below, we alter both $lr$ and $\lambda$ when training ViT-B/16 from scratch on ImageNet. Suggested by the heatmaps, Lion is more robust to different hyperparameter choices compared to AdamW. |
| 90 | + |
| 91 | +- Some may question whether Lion requires a large batch size to accurately determine the direction due to the added noise from the sign operation. To address this concern, we train a ViT-B/16 model on ImageNet using various batch sizes while maintaining the total training epoch as 300, and incorporating RandAug and Mixup techniques. |
| 92 | +As shown in figure below, the optimal batch size for AdamW is 256, while for Lion is 4,096. |
| 93 | +This indicates that Lion indeed prefers a larger batch size, but its performance remains robust even with a small 64 batch size. |
| 94 | +Furthermore, when the batch size enlarges to 32K, leading to only 11K training steps, |
| 95 | +Lion achieves a significant 2.5\% accuracy gain over AdamW (77.9\% vs. 75.4\%), demonstrating its effectiveness in the large batch training setting. |
| 96 | + |
| 97 | +<img src="./fig/ablation.png" width="100%"> |
| 98 | + |
| 99 | +**Left**: Ablation for the effect of batch size. Lion prefers a larger batch than AdamW. |
| 100 | +ImageNet accuracy of ViT-B/16 trained from scratch when we vary $lr$ and $\lambda$ for AdamW (**Middle**) and Lion (**Right**). Lion is more robust to different hyperparameter choices. |
0 commit comments