You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From left to right: Test Image, Ground Truth, Predicted Result
14
14
15
-
## Supported models:
15
+
## Highlights
16
+
17
+
### Syncronized Batch Normalization on PyTorch
18
+
This module differs from the built-in PyTorch BatchNorm as the mean and standard-deviation are reduced across all devices during training. For example, when one uses `nn.DataParallel` to wrap the network during training, PyTorch's implementation normalizes the tensor on each device using the statistics only on that device, which accelerated the computation and is also easy to implement, but the statistics might be inaccurate. Instead, in this synchronized version, the statistics will be computed over all training samples distributed on multiple devices.
19
+
20
+
The importance of synchronized batch normalization in object detection has been recently proved with a an extensive analysis in the paper [MegDet: A Large Mini-Batch Object Detector](https://arxiv.org/abs/1711.07240). And we empirically find that it is also important for segmentation.
21
+
22
+
The implementation is reasonable due to the following reasons:
23
+
- This implementation is in pure-python. No C++ extra extension libs.
24
+
- Easy to use.
25
+
- It is completely compatible with PyTorch's implementation. Specifically, it uses unbiased variance to update the moving average, and use sqrt(max(var, eps)) instead of sqrt(var + eps).
26
+
27
+
***To the best knowledge, it is the first pure-python implementation of sync bn on PyTorch, and also the first one completely compatible with PyTorch. It is also efficient, only 20% to 30% slower than un-sync bn.*** We especially thank [Jiayuan Mao](http://vccy.xyz/) for his kind contributions. For more details about the implementation and usage, refer to [Synchronized-BatchNorm-PyTorch](https://github.com/vacancy/Synchronized-BatchNorm-PyTorch).
28
+
29
+
### Dynamic scales of input for training with multiple GPUs
30
+
Different from image classification task, where the input images are resized to a certain scale such as 224x224 or 229x229, it is better for semantic segmentation and object detection networks that the aspect ratios of input images remain what they original are. It is not trivial on PyTorch, because the dataloader loads a pile of images first, then the `nn.DataParallel` module automatically splits them to multiple GPUs. That is, the images are concatenated first, then distributed. The concatenation, of course, requires the images to be of the same size.
31
+
32
+
Alternatively, we re-implement the `DataParallel` module, and make it support distributing data to multiple GPUs in python dict. You are free to put a lot of stuff in a dict. At the same time, the dataloader also operates differently. *Now the batch size of a dataloader always equals to the number of GPUs*, cause each elements will be sent to a GPU later. In this way, for example, if you'd like to put 2 images on each GPU, its the dataloader's job. We also need to make it compatible with the multi-processing. Note that the file index for the multi-processing dataloader is stored on the master process, which is in contradict to our goal that each worker maintains its own file list. So we use a trick that although the master process still gives dataloader an index for `__getitem__` function, we just ignore such request and send a random batch dict. Also, *the multiple workers forked by the dataloader all have the same seed*, you will find that multiple workers will yield exactly the same data, if we use the above-mentioned trick directly. Therefore, we add one line of code which sets the defaut seed for `numpy.random` before activating multiple worker in dataloader.
33
+
34
+
35
+
## Supported models
16
36
We split our models into encoder and decoder, where encoders are usually modified directly from classification networks, and decoders consist of final convolutions and upsampling.
17
37
18
38
Encoder:
19
-
- vgg16_dilated8
20
-
- vgg19_dilated8
21
-
- resnet34_dilated16
22
-
- resnet34_dilated8
23
39
- resnet50_dilated16
24
40
- resnet50_dilated8
25
41
42
+
***Coming soon***:
43
+
- resnet101_dilated16
44
+
- resnet101_dilated8
45
+
26
46
(resnetXX_dilatedYY: customized resnetXX with dilated convolutions, output feature map is 1/YY of input size.)
27
47
28
48
Decoder:
29
49
- c1_bilinear (1 conv + bilinear upsample)
50
+
- c1_bilinear_deepsup (c1_blinear + deep supervision trick)
30
51
- psp_bilinear (pyramid pooling + bilinear upsample, see PSPNet paper for details)
52
+
- psp_bilinear_deepsup (psp_bilinear + deep supervision trick)
53
+
54
+
***Coming soon***:
55
+
- UPerNet based on Feature Pyramid Network (FPN) and Pyramid Pooling Module (PPM), with down-sampling rate of 4, 8 and 16. It doesn't need dilated convolution, a operator that is time-and-memory consuming. It is comparable or even better compared with pspnet *with bells and whistles*, while requires much shorter training time and less GPU memory.
31
56
32
57
33
58
## Performance:
34
-
IMPORTANT: One obstacle to a good dilated ResNet model is that batch normalization layers are usually not well trained with a small batch size (<16). Ideally, batch size >64 will get you the best results. In this repo, we trained customized ResNet on Places365 (will be automatically downloaded when needed) as the initialization for scene parsing model, which partly solved the problem. You can simply set ```--fix_bn 1``` to freeze BN parameters during training.
35
-
36
-
<table><tbody>
37
-
<th valign="bottom">Encoder</th>
38
-
<th valign="bottom">Decoder</th>
39
-
<th valign="bottom">Mean IoU</th>
40
-
<th valign="bottom">Pixel Accuracy</th>
41
-
<tr>
42
-
<td>resnet34_dilated8</td>
43
-
<td>c1_bilinear</td>
44
-
<td>0.3277</td>
45
-
<td>76.47%</td>
46
-
</tr>
47
-
<tr>
48
-
<td>resnet34_dilated8</td>
49
-
<td>psp_bilinear</td>
50
-
<td>0.3634</td>
51
-
<td>77.98%</td>
52
-
</tr>
53
-
<tr>
54
-
<td>resnet50_dilated8</td>
55
-
<td>c1_bilinear</td>
56
-
<td>0.3385</td>
57
-
<td>76.40%</td>
58
-
</tr>
59
-
<tr>
60
-
<td>resnet50_dilated8</td>
61
-
<td>psp_bilinear</td>
62
-
<td>0.3800</td>
63
-
<td>78.21%</td>
64
-
</tr>
65
-
</tbody></table>
59
+
IMPORTANT: We use our self-trained base model on ImageNet. The model takes the input in BGR form (consistent with opencv) instead of RGB form as used by default implementation of PyTorch. The base model will be automatically downloaded when needed.
60
+
61
+
### Main Results
62
+
|| MS Test | Mean IoU | Accuracy | Overall | Training Time |
- Software: Ubuntu 16.04.3 LTS, CUDA 8.0, ***Python3.5***, ***PyTorch 0.4.0***
78
+
79
+
*Warning:* We don't support the outdated Python 2 anymore. PyTorch 0.4.0 or higher is required to run the codes.
71
80
72
81
## Training
73
82
1. Download the ADE20K scene parsing dataset:
74
83
```bash
75
84
chmod +x download_ADE20K.sh
76
85
./download_ADE20K.sh
77
86
```
78
-
2. Train a network (default: resnet34_dilated8). During training, checkpoints will be saved in folder ```ckpt```, visual results will be saved in folder ```vis```.
87
+
2. Train a network (default: resnet50_dilated8_deepsup). During training, checkpoints will be saved in folder ```ckpt```.
79
88
```bash
80
-
python train.py
89
+
python3 train.py
81
90
```
82
91
83
-
3. Input arguments: (see full input arguments via ```python train.py -h ```)
92
+
3. Input arguments: (see full input arguments via ```python3 train.py -h ```)
If you find the code or pre-trained models useful, please cite the following paper:
@@ -150,6 +144,15 @@ Scene Parsing through ADE20K Dataset. B. Zhou, H. Zhao, X. Puig, S. Fidler, A. B
150
144
year={2017}
151
145
}
152
146
147
+
Unified Perceptual Parsing for Scene Understanding. T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun. arXiv preprint
148
+
149
+
@article{xiao2018unified,
150
+
title={Unified Perceptual Parsing for Scene Understanding},
151
+
author={Xiao, Tete and Liu, Yingcheng and Zhou, Bolei and Jiang, Yuning and Sun, Jian},
152
+
journal={arXiv preprint},
153
+
year={2018}
154
+
}
155
+
153
156
Semantic Understanding of Scenes through ADE20K Dataset. B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso and A. Torralba. arXiv:1608.05442. (https://arxiv.org/pdf/1608.05442.pdf)
0 commit comments