|
| 1 | +# MAE |
| 2 | + |
| 3 | +[Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377) |
| 4 | + |
| 5 | +## Introduction |
| 6 | + |
| 7 | +<!-- [BACKBONE] --> |
| 8 | + |
| 9 | +<a href="https://github.com/facebookresearch/mae">Official Repo</a> |
| 10 | + |
| 11 | +<a href="https://github.com/open-mmlab/mmsegmentation/blob/v0.24.0/mmseg/models/backbones/mae.py#46">Code Snippet</a> |
| 12 | + |
| 13 | +## Abstract |
| 14 | + |
| 15 | +<!-- [ABSTRACT] --> |
| 16 | + |
| 17 | +This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior. |
| 18 | + |
| 19 | +<!-- [IMAGE] --> |
| 20 | +<div align=center> |
| 21 | +<img src="https://user-images.githubusercontent.com/24582831/165456416-1cba54bf-b1b5-4bdf-ad86-d6390de7f342.png" width="70%"/> |
| 22 | +</div> |
| 23 | + |
| 24 | +## Citation |
| 25 | + |
| 26 | +```bibtex |
| 27 | +@article{he2021masked, |
| 28 | + title={Masked autoencoders are scalable vision learners}, |
| 29 | + author={He, Kaiming and Chen, Xinlei and Xie, Saining and Li, Yanghao and Doll{\'a}r, Piotr and Girshick, Ross}, |
| 30 | + journal={arXiv preprint arXiv:2111.06377}, |
| 31 | + year={2021} |
| 32 | +} |
| 33 | +``` |
| 34 | + |
| 35 | +## Usage |
| 36 | + |
| 37 | +To use other repositories' pre-trained models, it is necessary to convert keys. |
| 38 | + |
| 39 | +We provide a script [`beit2mmseg.py`](../../tools/model_converters/beit2mmseg.py) in the tools directory to convert the key of MAE model from [the official repo](https://github.com/facebookresearch/mae) to MMSegmentation style. |
| 40 | + |
| 41 | +```shell |
| 42 | +python tools/model_converters/beit2mmseg.py ${PRETRAIN_PATH} ${STORE_PATH} |
| 43 | +``` |
| 44 | + |
| 45 | +E.g. |
| 46 | + |
| 47 | +```shell |
| 48 | +python tools/model_converters/beit2mmseg.py https://dl.fbaipublicfiles.com/mae/pretrain/mae_pretrain_vit_base.pth pretrain/mae_pretrain_vit_base_mmcls.pth |
| 49 | +``` |
| 50 | + |
| 51 | +This script convert model from `PRETRAIN_PATH` and store the converted model in `STORE_PATH`. |
| 52 | + |
| 53 | +In our default setting, pretrained models could be defined below: |
| 54 | + |
| 55 | + | pretrained models | original models | |
| 56 | + | ------ | -------- | |
| 57 | + |mae_pretrain_vit_base_mmcls.pth | ['mae_pretrain_vit_base'](https://dl.fbaipublicfiles.com/mae/pretrain/mae_pretrain_vit_base.pth) | |
| 58 | + |
| 59 | +Verify the single-scale results of the model: |
| 60 | + |
| 61 | +```shell |
| 62 | +sh tools/dist_test.sh \ |
| 63 | +configs/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k.py \ |
| 64 | +upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752-f92a2975.pth $GPUS --eval mIoU |
| 65 | +``` |
| 66 | + |
| 67 | +Since relative position embedding requires the input length and width to be equal, the sliding window is adopted for multi-scale inference. So we set min_size=512, that is, the shortest edge is 512. So the multi-scale inference of config is performed separately, instead of '--aug-test'. For multi-scale inference: |
| 68 | + |
| 69 | +```shell |
| 70 | +sh tools/dist_test.sh \ |
| 71 | +configs/mae/upernet_mae-base_fp16_512x512_160k_ade20k_ms.py \ |
| 72 | +upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752-f92a2975.pth $GPUS --eval mIoU |
| 73 | +``` |
| 74 | + |
| 75 | +## Results and models |
| 76 | + |
| 77 | +### ADE20K |
| 78 | + |
| 79 | +| Method | Backbone | Crop Size | pretrain | pretrain img size | Batch Size | Lr schd | Mem (GB) | Inf time (fps) | mIoU | mIoU(ms+flip) | config | download | |
| 80 | +| ------ | -------- | --------- | ---------- | ------- | -------- | --- | --- | -------------- | ----- | ------------: | -------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | |
| 81 | +| UperNet | ViT-B | 512x512 | ImageNet-1K | 224x224 | 16 | 160000 | 9.96 | 7.14 | 48.13 | 48.70 | [config](https://github.com/open-mmlab/mmsegmentation/blob/master/configs/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k.py) | [model](https://download.openmmlab.com/mmsegmentation/v0.5/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k/upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752-f92a2975.pth) | [log](https://download.openmmlab.com/mmsegmentation/v0.5/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k/upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752.log.json) | |
0 commit comments