Skip to content

Commit 13f9516

Browse files
authored
Merge pull request CSAILVision#20 from hangzhaomit/xtt
Major updates with synchronized BN and dynamic input scales.
2 parents 51fe6fd + 158ed59 commit 13f9516

31 files changed

+24610
-885
lines changed

README.md

Lines changed: 98 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -8,138 +8,156 @@ https://github.com/CSAILVision/sceneparsing
88
Pretrained models can be found at:
99
http://sceneparsing.csail.mit.edu/model/
1010

11-
<img src="./teaser/validation_ADE_val_00000278.png" width="900"/>
12-
<img src="./teaser/validation_ADE_val_00001519.png" width="900"/>
13-
From left to right: Test Image, Ground Truth, Predicted Result
11+
<img src="./teaser/ADE_val_00000278.png" width="900"/>
12+
<img src="./teaser/ADE_val_00001519.png" width="900"/>
13+
[From left to right: Test Image, Ground Truth, Predicted Result]
1414

15-
## Supported models:
15+
## Highlights [NEW!]
16+
17+
### Syncronized Batch Normalization on PyTorch
18+
This module differs from the built-in PyTorch BatchNorm as the mean and standard-deviation are reduced across all devices during training. The importance of synchronized batch normalization in object detection has been recently proved with a an extensive analysis in the paper [MegDet: A Large Mini-Batch Object Detector](https://arxiv.org/abs/1711.07240), and we empirically find that it is also important for segmentation.
19+
20+
The implementation is reasonable due to the following reasons:
21+
- This implementation is in pure-python. No C++ extra extension libs.
22+
- Easy to use.
23+
- It is completely compatible with PyTorch's implementation. Specifically, it uses unbiased variance to update the moving average, and use sqrt(max(var, eps)) instead of sqrt(var + eps).
24+
25+
***To the best knowledge, it is the first pure-python implementation of sync bn on PyTorch, and also the first one completely compatible with PyTorch. It is also efficient, only 20% to 30% slower than un-sync bn.*** We especially thank [Jiayuan Mao](http://vccy.xyz/) for his kind contributions. For more details about the implementation and usage, refer to [Synchronized-BatchNorm-PyTorch](https://github.com/vacancy/Synchronized-BatchNorm-PyTorch).
26+
27+
### Dynamic scales of input for training with multiple GPUs
28+
Different from image classification task, where the input images are resized to a fixed scale such as 224x224, it is better to keep original aspect ratios of input images for semantic segmentation and object detection networks.
29+
30+
So we re-implement the `DataParallel` module, and make it support distributing data to multiple GPUs in python dict. At the same time, the dataloader also operates differently. *Now the batch size of a dataloader always equals to the number of GPUs*, each element will be sent to a GPU. It is also compatible with multi-processing. Note that the file index for the multi-processing dataloader is stored on the master process, which is in contradict to our goal that each worker maintains its own file list. So we use a trick that although the master process still gives dataloader an index for `__getitem__` function, we just ignore such request and send a random batch dict. Also, *the multiple workers forked by the dataloader all have the same seed*, you will find that multiple workers will yield exactly the same data, if we use the above-mentioned trick directly. Therefore, we add one line of code which sets the defaut seed for `numpy.random` before activating multiple worker in dataloader.
31+
32+
33+
## Supported models
1634
We split our models into encoder and decoder, where encoders are usually modified directly from classification networks, and decoders consist of final convolutions and upsampling.
1735

18-
Encoder:
19-
- vgg16_dilated8
20-
- vgg19_dilated8
21-
- resnet34_dilated16
22-
- resnet34_dilated8
23-
- resnet50_dilated16
24-
- resnet50_dilated8
36+
Encoder: (resnetXX_dilatedYY: customized resnetXX with dilated convolutions, output feature map is 1/YY of input size.)
37+
- resnet34_dilated16, resnet34_dilated8
38+
- resnet50_dilated16, resnet50_dilated8
2539

26-
(resnetXX_dilatedYY: customized resnetXX with dilated convolutions, output feature map is 1/YY of input size.)
40+
***Coming soon***:
41+
- resnet101_dilated16, resnet101_dilated8
2742

2843
Decoder:
2944
- c1_bilinear (1 conv + bilinear upsample)
45+
- c1_bilinear_deepsup (c1_blinear + deep supervision trick)
3046
- psp_bilinear (pyramid pooling + bilinear upsample, see PSPNet paper for details)
47+
- psp_bilinear_deepsup (psp_bilinear + deep supervision trick)
48+
49+
***Coming soon***:
50+
- UPerNet based on Feature Pyramid Network (FPN) and Pyramid Pooling Module (PPM), with down-sampling rate of 4, 8 and 16. It doesn't need dilated convolution, a operator that is time-and-memory consuming. *Without bells and whistles*, it is comparable or even better compared with PSPNet, while requires much shorter training time and less GPU memory.
3151

3252

3353
## Performance:
34-
IMPORTANT: One obstacle to a good dilated ResNet model is that batch normalization layers are usually not well trained with a small batch size (<16). Ideally, batch size >64 will get you the best results. In this repo, we trained customized ResNet on Places365 (will be automatically downloaded when needed) as the initialization for scene parsing model, which partly solved the problem. You can simply set ```--fix_bn 1``` to freeze BN parameters during training.
54+
IMPORTANT: We use our self-trained base model on ImageNet. The model takes the input in BGR form (consistent with opencv) instead of RGB form as used by default implementation of PyTorch. The base model will be automatically downloaded when needed.
3555

3656
<table><tbody>
37-
<th valign="bottom">Encoder</th>
38-
<th valign="bottom">Decoder</th>
57+
<th valign="bottom">Architecture</th>
58+
<th valign="bottom">MS Test</th>
3959
<th valign="bottom">Mean IoU</th>
4060
<th valign="bottom">Pixel Accuracy</th>
61+
<th valign="bottom">Overall Score</th>
62+
<th valign="bottom">Training Time</th>
63+
<tr>
64+
<td>ResNet-50_dilated8 + c1_bilinear_deepsup</td>
65+
<td>No</td><td>34.88</td><td>76.54</td><td>55.71</td>
66+
<td>27.5 hours</td>
67+
</tr>
68+
<tr>
69+
<td rowspan="2">ResNet-50_dilated8 + psp_bilinear_deepsup</td>
70+
<td>No</td><td>40.60</td><td>79.66</td><td>60.13</td>
71+
<td rowspan="2">33.4 hours</td>
72+
</tr>
4173
<tr>
42-
<td>resnet34_dilated8</td>
43-
<td>c1_bilinear</td>
44-
<td>0.3277</td>
45-
<td>76.47%</td>
74+
<td>Yes</td><td>41.31</td><td>80.14</td><td>60.73</td>
4675
</tr>
4776
<tr>
48-
<td>resnet34_dilated8</td>
49-
<td>psp_bilinear</td>
50-
<td>0.3634</td>
51-
<td>77.98%</td>
77+
<td>ResNet-101_dilated8 + c1_bilinear_deepsup</td>
78+
<td>-</td><td>-</td><td>-</td><td>-</td>
79+
<td>- hours</td>
5280
</tr>
5381
<tr>
54-
<td>resnet50_dilated8</td>
55-
<td>c1_bilinear</td>
56-
<td>0.3385</td>
57-
<td>76.40%</td>
82+
<td>ResNet-101_dilated8 + psp_bilinear_deepsup</td>
83+
<td>-</td><td>-</td><td>-</td><td>-</td>
84+
<td>- hours</td>
5885
</tr>
5986
<tr>
60-
<td>resnet50_dilated8</td>
61-
<td>psp_bilinear</td>
62-
<td>0.3800</td>
63-
<td>78.21%</td>
87+
<td>UPerNet-50 (coming soon!)</td>
88+
<td>-</td><td>-</td><td>-</td><td>-</td>
89+
<td>- hours</td>
90+
</tr>
91+
<tr>
92+
<td>UPerNet-101 (coming soon!)</td>
93+
<td>-</td><td>-</td><td>-</td><td>-</td>
94+
<td>- hours</td>
6495
</tr>
6596
</tbody></table>
6697

6798
## Environment
6899
The code is developed under the following configurations.
69100
- Hardware: 2-8 Pascal Titan X GPUs (change ```[--num_gpus NUM_GPUS]``` accordingly)
70-
- Software: Ubuntu 14.04, CUDA8.0, Python2.7, PyTorch 0.2.0
101+
- Software: Ubuntu 16.04.3 LTS, CUDA 8.0, ***Python3.5***, ***PyTorch 0.4.0***
102+
103+
*Warning:* We don't support the outdated Python 2 anymore. PyTorch 0.4.0 or higher is required to run the codes.
71104

72105
## Training
73106
1. Download the ADE20K scene parsing dataset:
74107
```bash
75108
chmod +x download_ADE20K.sh
76109
./download_ADE20K.sh
77110
```
78-
2. Train a network (default: resnet34_dilated8). During training, checkpoints will be saved in folder ```ckpt```, visual results will be saved in folder ```vis```.
111+
2. Train a network (default: ResNet-50_dilated8 + psp_bilinear_deepsup). During training, checkpoints will be saved in folder ```ckpt```.
79112
```bash
80-
python train.py
113+
python3 train.py --num_gpus NUM_GPUS
81114
```
82115

83-
3. Input arguments: (see full input arguments via ```python train.py -h ```)
116+
3. Input arguments: (see full input arguments via ```python3 train.py -h ```)
84117
```bash
85118
usage: train.py [-h] [--id ID] [--arch_encoder ARCH_ENCODER]
86119
[--arch_decoder ARCH_DECODER]
87120
[--weights_encoder WEIGHTS_ENCODER]
88121
[--weights_decoder WEIGHTS_DECODER] [--fc_dim FC_DIM]
89122
[--list_train LIST_TRAIN] [--list_val LIST_VAL]
90-
[--root_img ROOT_IMG] [--root_seg ROOT_SEG]
91-
[--num_gpus NUM_GPUS]
123+
[--root_dataset ROOT_DATASET] [--num_gpus NUM_GPUS]
92124
[--batch_size_per_gpu BATCH_SIZE_PER_GPU]
93-
[--num_epoch NUM_EPOCH] [--optim OPTIM]
94-
[--lr_encoder LR_ENCODER] [--lr_decoder LR_DECODER]
95-
[--beta1 BETA1]
96-
[--weight_decay WEIGHT_DECAY] [--fix_bn FIX_BN]
97-
[--num_val NUM_VAL] [--workers WORKERS] [--imgSize IMGSIZE]
98-
[--segSize SEGSIZE] [--num_class NUM_CLASS]
99-
[--seed SEED] [--ckpt CKPT] [--vis VIS]
100-
[--disp_iter DISP_ITER] [--eval_epoch EVAL_EPOCH]
101-
[--ckpt_epoch CKPT_EPOCH]
125+
[--num_epoch NUM_EPOCH] [--epoch_iters EPOCH_ITERS]
126+
[--optim OPTIM] [--lr_encoder LR_ENCODER]
127+
[--lr_decoder LR_DECODER] [--lr_pow LR_POW] [--beta1 BETA1]
128+
[--weight_decay WEIGHT_DECAY]
129+
[--deep_sup_scale DEEP_SUP_SCALE] [--fix_bn FIX_BN]
130+
[--num_class NUM_CLASS] [--workers WORKERS]
131+
[--imgSize IMGSIZE] [--imgMaxSize IMGMAXSIZE]
132+
[--padding_constant PADDING_CONSTANT]
133+
[--segm_downsampling_rate SEGM_DOWNSAMPLING_RATE]
134+
[--random_flip RANDOM_FLIP] [--seed SEED] [--ckpt CKPT]
135+
[--disp_iter DISP_ITER]
102136
```
103137

104138

105139
## Evaluation
106-
1. Evaluate a trained network on the validation set:
140+
1. Evaluate a trained network on the validation set. Add ```--visualize``` option to output visualizations shown in teaser.
107141
```bash
108-
python eval.py --id MODEL_ID
142+
python3 eval.py --id MODEL_ID --suffix SUFFIX
109143
```
110144

111-
2. Input arguments: (see full input arguments via ```python eval.py -h ```)
145+
2. Input arguments: (see full input arguments via ```python3 eval.py -h ```)
112146
```bash
113147
usage: eval.py [-h] --id ID [--suffix SUFFIX] [--arch_encoder ARCH_ENCODER]
114148
[--arch_decoder ARCH_DECODER] [--fc_dim FC_DIM]
115-
[--list_val LIST_VAL] [--root_img ROOT_IMG]
116-
[--root_seg ROOT_SEG] [--num_val NUM_VAL]
149+
[--list_val LIST_VAL] [--root_dataset ROOT_DATASET]
150+
[--num_val NUM_VAL] [--num_class NUM_CLASS]
117151
[--batch_size BATCH_SIZE] [--imgSize IMGSIZE]
118-
[--segSize SEGSIZE] [--num_class NUM_CLASS] [--ckpt CKPT]
119-
[--visualize VISUALIZE] [--result RESULT]
120-
```
121-
122-
123-
## Test/Inference
124-
1. Here is a simple demo to do inference on a single image:
125-
```bash
126-
chmod +x demo_test.sh
127-
./demo_test.sh
152+
[--imgMaxSize IMGMAXSIZE] [--padding_constant PADDING_CONSTANT]
153+
[--segm_downsampling_rate SEGM_DOWNSAMPLING_RATE] [--ckpt CKPT]
154+
[--visualize] [--result RESULT] [--gpu_id GPU_ID]
128155
```
129-
This script downloads pretrained models and a test image, runs the test script, and saves predicted segmentation (.png) to the working directory.
130156

131-
2. Input arguments: (see full input arguments via ```python test.py -h ```)
132-
```bash
133-
usage: test.py [-h] --test_img TEST_IMG --model_path MODEL_PATH
134-
[--suffix SUFFIX] [--result RESULT]
135-
[--arch_encoder ARCH_ENCODER] [--arch_decoder ARCH_DECODER]
136-
[--fc_dim FC_DIM] [--num_class NUM_CLASS] [--imgSize IMGSIZE]
137-
[--segSize SEGSIZE]
138-
```
139157

140158
## Reference
141159

142-
If you find the code or pre-trained models useful, please cite the following paper:
160+
If you find the code or pre-trained models useful, please cite the following papers:
143161

144162
Scene Parsing through ADE20K Dataset. B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso and A. Torralba. Computer Vision and Pattern Recognition (CVPR), 2017. (http://people.csail.mit.edu/bzhou/publication/scene-parse-camera-ready.pdf)
145163

@@ -150,6 +168,15 @@ Scene Parsing through ADE20K Dataset. B. Zhou, H. Zhao, X. Puig, S. Fidler, A. B
150168
year={2017}
151169
}
152170

171+
Unified Perceptual Parsing for Scene Understanding. T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun. arXiv preprint
172+
173+
@article{xiao2018unified,
174+
title={Unified Perceptual Parsing for Scene Understanding},
175+
author={Xiao, Tete and Liu, Yingcheng and Zhou, Bolei and Jiang, Yuning and Sun, Jian},
176+
journal={arXiv preprint},
177+
year={2018}
178+
}
179+
153180
Semantic Understanding of Scenes through ADE20K Dataset. B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso and A. Torralba. arXiv:1608.05442. (https://arxiv.org/pdf/1608.05442.pdf)
154181

155182
@article{zhou2016semantic,

0 commit comments

Comments
 (0)