[Feature]: Add MAE (open-mmlab#1307)

YuanLiuuuuuu · MengzhangLI · web-flow · commit 6563cb513ec4 · 2022-04-28T00:54:20.000+08:00
* [Fix]: Fix lint

* [WIP]: Add mae seg config

* [Feature]: Add MAE seg

* [Fix]: Fix mae dataset img scale bug

* [Fix]: Fix lint

* [Feature]: Change mae config to mae_segmentation's config

* [Feature]: Add interpolate pe when loading

* [Fix]: Fix pos_embed not used bug

* [Fix]: Fix lint

* [Fix]: Init rel pos embed with zeros

* [Fix]: Fix lint

* [Fix]: Change the type name of backbone to MAE

* [Fix]: Delete ade20k_512x512.py

* [Fix]: Use mmseg provided ade20k.py

* [Fix]: Change 1 sample per gpu to 2 samples per gpu

* [Fix]: Fix conflict

* [Refactor]: Use the TransformerEncoderLayer of  BEiT

* [Feature]: Add UT

* [Fix]: Change the default value of qv bias to False

* [Fix]: Initialize relative pos table with zeros

* [Fix]: Delete redundant code in mae

* [Fix]: Fix lint

* [Fix]: Rename qkv_bias to qv_bias

* [Fix]: Add docstring to weight_init of MAEAttention

* [Refactor]: Delete qv_bias param

* [Fix]: Add reference to fix_init_weight

* [Fix]: Fix lint

* [Fix]: Delete extra crop size

* [Refactor]: Rename mae

* [Fix]: Set bias to True

* [Fix]: Delete redundant params

* [Fix]: Fix lint

* [Fix]: Fix UT

* [Fix]: Add resize abs pos embed

* [Fix]: Fix UT

* [Refactor]: Use build layer

* [Fix]: Add licsense and fix docstring

* [Fix]: Fix docstring

* [Feature]: Add README metafile

* [Fix]: Change 640 to 512

* [Fix]: Fix README

* fix readme of MAE

Co-authored-by: MengzhangLI &lt;mcmong@pku.edu.cn&gt;
diff --git a/configs/_base_/models/upernet_mae.py b/configs/_base_/models/upernet_mae.py
@@ -0,0 +1,49 @@
+norm_cfg = dict(type='SyncBN', requires_grad=True)
+model = dict(
+    type='EncoderDecoder',
+    pretrained=None,
+    backbone=dict(
+        type='MAE',
+        img_size=(640, 640),
+        patch_size=16,
+        in_channels=3,
+        embed_dims=768,
+        num_layers=12,
+        num_heads=12,
+        mlp_ratio=4,
+        out_indices=(3, 5, 7, 11),
+        attn_drop_rate=0.0,
+        drop_path_rate=0.1,
+        norm_cfg=dict(type='LN', eps=1e-6),
+        act_cfg=dict(type='GELU'),
+        norm_eval=False,
+        init_values=0.1),
+    neck=dict(type='Feature2Pyramid', embed_dim=768, rescales=[4, 2, 1, 0.5]),
+    decode_head=dict(
+        type='UPerHead',
+        in_channels=[384, 384, 384, 384],
+        in_index=[0, 1, 2, 3],
+        pool_scales=(1, 2, 3, 6),
+        channels=512,
+        dropout_ratio=0.1,
+        num_classes=19,
+        norm_cfg=norm_cfg,
+        align_corners=False,
+        loss_decode=dict(
+            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0)),
+    auxiliary_head=dict(
+        type='FCNHead',
+        in_channels=384,
+        in_index=2,
+        channels=256,
+        num_convs=1,
+        concat_input=False,
+        dropout_ratio=0.1,
+        num_classes=19,
+        norm_cfg=norm_cfg,
+        align_corners=False,
+        loss_decode=dict(
+            type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)),
+    # model training and testing settings
+    train_cfg=dict(),
+    test_cfg=dict(mode='whole'))
diff --git a/configs/mae/README.md b/configs/mae/README.md
@@ -0,0 +1,81 @@
+# MAE
+
+[Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377)
+
+## Introduction
+
+<!-- [BACKBONE] -->
+
+<a href="https://github.com/facebookresearch/mae">Official Repo</a>
+
+<a href="https://github.com/open-mmlab/mmsegmentation/blob/v0.24.0/mmseg/models/backbones/mae.py#46">Code Snippet</a>
+
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.
+
+<!-- [IMAGE] -->
+<div align=center>
+<img src="https://user-images.githubusercontent.com/24582831/165456416-1cba54bf-b1b5-4bdf-ad86-d6390de7f342.png" width="70%"/>
+</div>
+
+## Citation
+
+```bibtex
+@article{he2021masked,
+  title={Masked autoencoders are scalable vision learners},
+  author={He, Kaiming and Chen, Xinlei and Xie, Saining and Li, Yanghao and Doll{\'a}r, Piotr and Girshick, Ross},
+  journal={arXiv preprint arXiv:2111.06377},
+  year={2021}
+}
+```
+
+## Usage
+
+To use other repositories' pre-trained models, it is necessary to convert keys.
+
+We provide a script [`beit2mmseg.py`](../../tools/model_converters/beit2mmseg.py) in the tools directory to convert the key of MAE model from [the official repo](https://github.com/facebookresearch/mae) to MMSegmentation style.
+
+```shell
+python tools/model_converters/beit2mmseg.py ${PRETRAIN_PATH} ${STORE_PATH}
+```
+
+E.g.
+
+```shell
+python tools/model_converters/beit2mmseg.py https://dl.fbaipublicfiles.com/mae/pretrain/mae_pretrain_vit_base.pth pretrain/mae_pretrain_vit_base_mmcls.pth
+```
+
+This script convert model from `PRETRAIN_PATH` and store the converted model in `STORE_PATH`.
+
+In our default setting, pretrained models could be defined below:
+
+  | pretrained models | original models |
+  | ------ | -------- |
+  |mae_pretrain_vit_base_mmcls.pth | ['mae_pretrain_vit_base'](https://dl.fbaipublicfiles.com/mae/pretrain/mae_pretrain_vit_base.pth) |
+
+Verify the single-scale results of the model:
+
+```shell
+sh tools/dist_test.sh \
+configs/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k.py \
+upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752-f92a2975.pth $GPUS --eval mIoU
+```
+
+Since relative position embedding requires the input length and width to be equal, the sliding window is adopted for multi-scale inference. So we set min_size=512, that is, the shortest edge is 512. So the multi-scale inference of config is performed separately, instead of '--aug-test'. For multi-scale inference:
+
+```shell
+sh tools/dist_test.sh \
+configs/mae/upernet_mae-base_fp16_512x512_160k_ade20k_ms.py \
+upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752-f92a2975.pth $GPUS --eval mIoU
+```
+
+## Results and models
+
+### ADE20K
+
+| Method | Backbone | Crop Size | pretrain | pretrain img size | Batch Size | Lr schd | Mem (GB) | Inf time (fps) | mIoU  | mIoU(ms+flip) | config | download |
+| ------ | -------- | --------- | ---------- | ------- | -------- | --- | --- | -------------- | ----- | ------------: | -------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| UperNet | ViT-B | 512x512 | ImageNet-1K | 224x224 | 16          | 160000   | 9.96        | 7.14              | 48.13 | 48.70            | [config](https://github.com/open-mmlab/mmsegmentation/blob/master/configs/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k.py)  | [model](https://download.openmmlab.com/mmsegmentation/v0.5/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k/upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752-f92a2975.pth) &#124; [log](https://download.openmmlab.com/mmsegmentation/v0.5/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k/upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752.log.json)     |
diff --git a/configs/mae/mae.yml b/configs/mae/mae.yml
@@ -0,0 +1,23 @@
+Models:
+- Name: upernet_mae-base_fp16_8x2_512x512_160k_ade20k
+  In Collection: UperNet
+  Metadata:
+    backbone: ViT-B
+    crop size: (512,512)
+    lr schd: 160000
+    inference time (ms/im):
+    - value: 140.06
+      hardware: V100
+      backend: PyTorch
+      batch size: 1
+      mode: FP16
+      resolution: (512,512)
+    Training Memory (GB): 9.96
+  Results:
+  - Task: Semantic Segmentation
+    Dataset: ADE20K
+    Metrics:
+      mIoU: 48.13
+      mIoU(ms+flip): 48.7
+  Config: configs/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k.py
+  Weights: https://download.openmmlab.com/mmsegmentation/v0.5/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k/upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752-f92a2975.pth
diff --git a/configs/mae/upernet_mae-base_fp16_512x512_160k_ade20k_ms.py b/configs/mae/upernet_mae-base_fp16_512x512_160k_ade20k_ms.py
@@ -0,0 +1,24 @@
+_base_ = './upernet_mae-base_fp16_8x2_512x512_160k_ade20k.py'
+
+img_norm_cfg = dict(
+    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
+
+test_pipeline = [
+    dict(type='LoadImageFromFile'),
+    dict(
+        type='MultiScaleFlipAug',
+        img_scale=(2048, 512),
+        img_ratios=[0.5, 0.75, 1.0, 1.25, 1.5, 1.75],
+        flip=True,
+        transforms=[
+            dict(type='Resize', keep_ratio=True, min_size=512),
+            dict(type='RandomFlip'),
+            dict(type='Normalize', **img_norm_cfg),
+            dict(type='ImageToTensor', keys=['img']),
+            dict(type='Collect', keys=['img']),
+        ])
+]
+data = dict(
+    val=dict(pipeline=test_pipeline),
+    test=dict(pipeline=test_pipeline),
+    samples_per_gpu=2)
diff --git a/configs/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k.py b/configs/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k.py
@@ -0,0 +1,48 @@
+_base_ = [
+    '../_base_/models/upernet_mae.py', '../_base_/datasets/ade20k.py',
+    '../_base_/default_runtime.py', '../_base_/schedules/schedule_160k.py'
+]
+
+model = dict(
+    pretrained='./pretrain/mae_pretrain_vit_base_mmcls.pth',
+    backbone=dict(
+        type='MAE',
+        img_size=(512, 512),
+        patch_size=16,
+        embed_dims=768,
+        num_layers=12,
+        num_heads=12,
+        mlp_ratio=4,
+        init_values=1.0,
+        drop_path_rate=0.1,
+        out_indices=[3, 5, 7, 11]),
+    neck=dict(embed_dim=768, rescales=[4, 2, 1, 0.5]),
+    decode_head=dict(
+        in_channels=[768, 768, 768, 768], num_classes=150, channels=768),
+    auxiliary_head=dict(in_channels=768, num_classes=150),
+    test_cfg=dict(mode='slide', crop_size=(512, 512), stride=(341, 341)))
+
+optimizer = dict(
+    _delete_=True,
+    type='AdamW',
+    lr=1e-4,
+    betas=(0.9, 0.999),
+    weight_decay=0.05,
+    constructor='LayerDecayOptimizerConstructor',
+    paramwise_cfg=dict(num_layers=12, layer_decay_rate=0.65))
+
+lr_config = dict(
+    _delete_=True,
+    policy='poly',
+    warmup='linear',
+    warmup_iters=1500,
+    warmup_ratio=1e-6,
+    power=1.0,
+    min_lr=0.0,
+    by_epoch=False)
+
+# mixed precision
+fp16 = dict(loss_scale='dynamic')
+
+# By default, models are trained on 8 GPUs with 2 images per GPU
+data = dict(samples_per_gpu=2)
diff --git a/mmseg/models/backbones/__init__.py b/mmseg/models/backbones/__init__.py
@@ -7,6 +7,7 @@
 from .fast_scnn import FastSCNN
 from .hrnet import HRNet
 from .icnet import ICNet
+from .mae import MAE
 from .mit import MixVisionTransformer
 from .mobilenet_v2 import MobileNetV2
 from .mobilenet_v3 import MobileNetV3
@@ -25,5 +26,5 @@
     'ResNeSt', 'MobileNetV2', 'UNet', 'CGNet', 'MobileNetV3',
     'VisionTransformer', 'SwinTransformer', 'MixVisionTransformer',
     'BiSeNetV1', 'BiSeNetV2', 'ICNet', 'TIMMBackbone', 'ERFNet', 'PCPVT',
-    'SVT', 'STDCNet', 'STDCContextPathNet', 'BEiT'
+    'SVT', 'STDCNet', 'STDCContextPathNet', 'BEiT', 'MAE'
 ]
diff --git a/mmseg/models/backbones/mae.py b/mmseg/models/backbones/mae.py
diff --git a/model-index.yml b/model-index.yml
diff --git a/tests/test_models/test_backbones/test_mae.py b/tests/test_models/test_backbones/test_mae.py