Skip to content

Commit 6563cb5

Browse files
[Feature]: Add MAE (open-mmlab#1307)
* [Fix]: Fix lint * [WIP]: Add mae seg config * [Feature]: Add MAE seg * [Fix]: Fix mae dataset img scale bug * [Fix]: Fix lint * [Feature]: Change mae config to mae_segmentation's config * [Feature]: Add interpolate pe when loading * [Fix]: Fix pos_embed not used bug * [Fix]: Fix lint * [Fix]: Init rel pos embed with zeros * [Fix]: Fix lint * [Fix]: Change the type name of backbone to MAE * [Fix]: Delete ade20k_512x512.py * [Fix]: Use mmseg provided ade20k.py * [Fix]: Change 1 sample per gpu to 2 samples per gpu * [Fix]: Fix conflict * [Refactor]: Use the TransformerEncoderLayer of BEiT * [Feature]: Add UT * [Fix]: Change the default value of qv bias to False * [Fix]: Initialize relative pos table with zeros * [Fix]: Delete redundant code in mae * [Fix]: Fix lint * [Fix]: Rename qkv_bias to qv_bias * [Fix]: Add docstring to weight_init of MAEAttention * [Refactor]: Delete qv_bias param * [Fix]: Add reference to fix_init_weight * [Fix]: Fix lint * [Fix]: Delete extra crop size * [Refactor]: Rename mae * [Fix]: Set bias to True * [Fix]: Delete redundant params * [Fix]: Fix lint * [Fix]: Fix UT * [Fix]: Add resize abs pos embed * [Fix]: Fix UT * [Refactor]: Use build layer * [Fix]: Add licsense and fix docstring * [Fix]: Fix docstring * [Feature]: Add README metafile * [Fix]: Change 640 to 512 * [Fix]: Fix README * fix readme of MAE Co-authored-by: MengzhangLI <[email protected]>
1 parent 69b28e0 commit 6563cb5

File tree

9 files changed

+672
-1
lines changed

9 files changed

+672
-1
lines changed
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
norm_cfg = dict(type='SyncBN', requires_grad=True)
2+
model = dict(
3+
type='EncoderDecoder',
4+
pretrained=None,
5+
backbone=dict(
6+
type='MAE',
7+
img_size=(640, 640),
8+
patch_size=16,
9+
in_channels=3,
10+
embed_dims=768,
11+
num_layers=12,
12+
num_heads=12,
13+
mlp_ratio=4,
14+
out_indices=(3, 5, 7, 11),
15+
attn_drop_rate=0.0,
16+
drop_path_rate=0.1,
17+
norm_cfg=dict(type='LN', eps=1e-6),
18+
act_cfg=dict(type='GELU'),
19+
norm_eval=False,
20+
init_values=0.1),
21+
neck=dict(type='Feature2Pyramid', embed_dim=768, rescales=[4, 2, 1, 0.5]),
22+
decode_head=dict(
23+
type='UPerHead',
24+
in_channels=[384, 384, 384, 384],
25+
in_index=[0, 1, 2, 3],
26+
pool_scales=(1, 2, 3, 6),
27+
channels=512,
28+
dropout_ratio=0.1,
29+
num_classes=19,
30+
norm_cfg=norm_cfg,
31+
align_corners=False,
32+
loss_decode=dict(
33+
type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0)),
34+
auxiliary_head=dict(
35+
type='FCNHead',
36+
in_channels=384,
37+
in_index=2,
38+
channels=256,
39+
num_convs=1,
40+
concat_input=False,
41+
dropout_ratio=0.1,
42+
num_classes=19,
43+
norm_cfg=norm_cfg,
44+
align_corners=False,
45+
loss_decode=dict(
46+
type='CrossEntropyLoss', use_sigmoid=False, loss_weight=0.4)),
47+
# model training and testing settings
48+
train_cfg=dict(),
49+
test_cfg=dict(mode='whole'))

configs/mae/README.md

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# MAE
2+
3+
[Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377)
4+
5+
## Introduction
6+
7+
<!-- [BACKBONE] -->
8+
9+
<a href="https://github.com/facebookresearch/mae">Official Repo</a>
10+
11+
<a href="https://github.com/open-mmlab/mmsegmentation/blob/v0.24.0/mmseg/models/backbones/mae.py#46">Code Snippet</a>
12+
13+
## Abstract
14+
15+
<!-- [ABSTRACT] -->
16+
17+
This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.
18+
19+
<!-- [IMAGE] -->
20+
<div align=center>
21+
<img src="https://user-images.githubusercontent.com/24582831/165456416-1cba54bf-b1b5-4bdf-ad86-d6390de7f342.png" width="70%"/>
22+
</div>
23+
24+
## Citation
25+
26+
```bibtex
27+
@article{he2021masked,
28+
title={Masked autoencoders are scalable vision learners},
29+
author={He, Kaiming and Chen, Xinlei and Xie, Saining and Li, Yanghao and Doll{\'a}r, Piotr and Girshick, Ross},
30+
journal={arXiv preprint arXiv:2111.06377},
31+
year={2021}
32+
}
33+
```
34+
35+
## Usage
36+
37+
To use other repositories' pre-trained models, it is necessary to convert keys.
38+
39+
We provide a script [`beit2mmseg.py`](../../tools/model_converters/beit2mmseg.py) in the tools directory to convert the key of MAE model from [the official repo](https://github.com/facebookresearch/mae) to MMSegmentation style.
40+
41+
```shell
42+
python tools/model_converters/beit2mmseg.py ${PRETRAIN_PATH} ${STORE_PATH}
43+
```
44+
45+
E.g.
46+
47+
```shell
48+
python tools/model_converters/beit2mmseg.py https://dl.fbaipublicfiles.com/mae/pretrain/mae_pretrain_vit_base.pth pretrain/mae_pretrain_vit_base_mmcls.pth
49+
```
50+
51+
This script convert model from `PRETRAIN_PATH` and store the converted model in `STORE_PATH`.
52+
53+
In our default setting, pretrained models could be defined below:
54+
55+
| pretrained models | original models |
56+
| ------ | -------- |
57+
|mae_pretrain_vit_base_mmcls.pth | ['mae_pretrain_vit_base'](https://dl.fbaipublicfiles.com/mae/pretrain/mae_pretrain_vit_base.pth) |
58+
59+
Verify the single-scale results of the model:
60+
61+
```shell
62+
sh tools/dist_test.sh \
63+
configs/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k.py \
64+
upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752-f92a2975.pth $GPUS --eval mIoU
65+
```
66+
67+
Since relative position embedding requires the input length and width to be equal, the sliding window is adopted for multi-scale inference. So we set min_size=512, that is, the shortest edge is 512. So the multi-scale inference of config is performed separately, instead of '--aug-test'. For multi-scale inference:
68+
69+
```shell
70+
sh tools/dist_test.sh \
71+
configs/mae/upernet_mae-base_fp16_512x512_160k_ade20k_ms.py \
72+
upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752-f92a2975.pth $GPUS --eval mIoU
73+
```
74+
75+
## Results and models
76+
77+
### ADE20K
78+
79+
| Method | Backbone | Crop Size | pretrain | pretrain img size | Batch Size | Lr schd | Mem (GB) | Inf time (fps) | mIoU | mIoU(ms+flip) | config | download |
80+
| ------ | -------- | --------- | ---------- | ------- | -------- | --- | --- | -------------- | ----- | ------------: | -------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
81+
| UperNet | ViT-B | 512x512 | ImageNet-1K | 224x224 | 16 | 160000 | 9.96 | 7.14 | 48.13 | 48.70 | [config](https://github.com/open-mmlab/mmsegmentation/blob/master/configs/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k.py) | [model](https://download.openmmlab.com/mmsegmentation/v0.5/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k/upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752-f92a2975.pth) &#124; [log](https://download.openmmlab.com/mmsegmentation/v0.5/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k/upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752.log.json) |

configs/mae/mae.yml

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
Models:
2+
- Name: upernet_mae-base_fp16_8x2_512x512_160k_ade20k
3+
In Collection: UperNet
4+
Metadata:
5+
backbone: ViT-B
6+
crop size: (512,512)
7+
lr schd: 160000
8+
inference time (ms/im):
9+
- value: 140.06
10+
hardware: V100
11+
backend: PyTorch
12+
batch size: 1
13+
mode: FP16
14+
resolution: (512,512)
15+
Training Memory (GB): 9.96
16+
Results:
17+
- Task: Semantic Segmentation
18+
Dataset: ADE20K
19+
Metrics:
20+
mIoU: 48.13
21+
mIoU(ms+flip): 48.7
22+
Config: configs/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k.py
23+
Weights: https://download.openmmlab.com/mmsegmentation/v0.5/mae/upernet_mae-base_fp16_8x2_512x512_160k_ade20k/upernet_mae-base_fp16_8x2_512x512_160k_ade20k_20220426_174752-f92a2975.pth
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
_base_ = './upernet_mae-base_fp16_8x2_512x512_160k_ade20k.py'
2+
3+
img_norm_cfg = dict(
4+
mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
5+
6+
test_pipeline = [
7+
dict(type='LoadImageFromFile'),
8+
dict(
9+
type='MultiScaleFlipAug',
10+
img_scale=(2048, 512),
11+
img_ratios=[0.5, 0.75, 1.0, 1.25, 1.5, 1.75],
12+
flip=True,
13+
transforms=[
14+
dict(type='Resize', keep_ratio=True, min_size=512),
15+
dict(type='RandomFlip'),
16+
dict(type='Normalize', **img_norm_cfg),
17+
dict(type='ImageToTensor', keys=['img']),
18+
dict(type='Collect', keys=['img']),
19+
])
20+
]
21+
data = dict(
22+
val=dict(pipeline=test_pipeline),
23+
test=dict(pipeline=test_pipeline),
24+
samples_per_gpu=2)
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
_base_ = [
2+
'../_base_/models/upernet_mae.py', '../_base_/datasets/ade20k.py',
3+
'../_base_/default_runtime.py', '../_base_/schedules/schedule_160k.py'
4+
]
5+
6+
model = dict(
7+
pretrained='./pretrain/mae_pretrain_vit_base_mmcls.pth',
8+
backbone=dict(
9+
type='MAE',
10+
img_size=(512, 512),
11+
patch_size=16,
12+
embed_dims=768,
13+
num_layers=12,
14+
num_heads=12,
15+
mlp_ratio=4,
16+
init_values=1.0,
17+
drop_path_rate=0.1,
18+
out_indices=[3, 5, 7, 11]),
19+
neck=dict(embed_dim=768, rescales=[4, 2, 1, 0.5]),
20+
decode_head=dict(
21+
in_channels=[768, 768, 768, 768], num_classes=150, channels=768),
22+
auxiliary_head=dict(in_channels=768, num_classes=150),
23+
test_cfg=dict(mode='slide', crop_size=(512, 512), stride=(341, 341)))
24+
25+
optimizer = dict(
26+
_delete_=True,
27+
type='AdamW',
28+
lr=1e-4,
29+
betas=(0.9, 0.999),
30+
weight_decay=0.05,
31+
constructor='LayerDecayOptimizerConstructor',
32+
paramwise_cfg=dict(num_layers=12, layer_decay_rate=0.65))
33+
34+
lr_config = dict(
35+
_delete_=True,
36+
policy='poly',
37+
warmup='linear',
38+
warmup_iters=1500,
39+
warmup_ratio=1e-6,
40+
power=1.0,
41+
min_lr=0.0,
42+
by_epoch=False)
43+
44+
# mixed precision
45+
fp16 = dict(loss_scale='dynamic')
46+
47+
# By default, models are trained on 8 GPUs with 2 images per GPU
48+
data = dict(samples_per_gpu=2)

mmseg/models/backbones/__init__.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
from .fast_scnn import FastSCNN
88
from .hrnet import HRNet
99
from .icnet import ICNet
10+
from .mae import MAE
1011
from .mit import MixVisionTransformer
1112
from .mobilenet_v2 import MobileNetV2
1213
from .mobilenet_v3 import MobileNetV3
@@ -25,5 +26,5 @@
2526
'ResNeSt', 'MobileNetV2', 'UNet', 'CGNet', 'MobileNetV3',
2627
'VisionTransformer', 'SwinTransformer', 'MixVisionTransformer',
2728
'BiSeNetV1', 'BiSeNetV2', 'ICNet', 'TIMMBackbone', 'ERFNet', 'PCPVT',
28-
'SVT', 'STDCNet', 'STDCContextPathNet', 'BEiT'
29+
'SVT', 'STDCNet', 'STDCContextPathNet', 'BEiT', 'MAE'
2930
]

0 commit comments

Comments
 (0)