Skip to content

Commit ecda117

Browse files
authored
[Enhance] New-style CPU training and inference. (open-mmlab#1251)
* [Enhance] New-style CPU training and inference. * assert mmcv version * SyncBN to BN in training and testing * SyncBN to BN in training and testing * upload untracked files to this branch * delete gpu_ids * fix bugs * assert args.gpu_id in train.py * use cfg.gpu_ids = [args.gpu_id] * use cfg.gpu_ids = [args.gpu_id] * fix typo * fix typo * fix typos
1 parent 02d2790 commit ecda117

File tree

6 files changed

+63
-5
lines changed

6 files changed

+63
-5
lines changed

docs/en/inference.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ and also some high-level apis for easier integration to other projects.
66
### Test a dataset
77

88
- single GPU
9+
- CPU
910
- single node multiple GPU
1011
- multiple node
1112

@@ -15,6 +16,10 @@ You can use the following commands to test a dataset.
1516
# single-gpu testing
1617
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [--out ${RESULT_FILE}] [--eval ${EVAL_METRICS}] [--show]
1718

19+
# CPU: disable GPUs and run single-gpu testing script
20+
export CUDA_VISIBLE_DEVICES=-1
21+
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [--out ${RESULT_FILE}] [--eval ${EVAL_METRICS}] [--show]
22+
1823
# multi-gpu testing
1924
./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [--out ${RESULT_FILE}] [--eval ${EVAL_METRICS}]
2025
```

docs/en/train.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,20 @@ python tools/train.py ${CONFIG_FILE} [optional arguments]
3333

3434
If you want to specify the working directory in the command, you can add an argument `--work-dir ${YOUR_WORK_DIR}`.
3535

36+
### Train with CPU
37+
38+
The process of training on the CPU is consistent with single GPU training. We just need to disable GPUs before the training process.
39+
40+
```shell
41+
export CUDA_VISIBLE_DEVICES=-1
42+
```
43+
44+
And then run the script [above](#train-with-a-single-gpu).
45+
46+
```{warning}
47+
The process of training on the CPU is consistent with single GPU training. We just need to disable GPUs before the training process.
48+
```
49+
3650
### Train with multiple GPUs
3751

3852
```shell

docs/zh_cn/inference.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
### 测试一个数据集
66

77
- 单卡 GPU
8+
- CPU
89
- 单节点多卡 GPU
910
- 多节点
1011

@@ -14,6 +15,10 @@
1415
# 单卡 GPU 测试
1516
python tools/test.py ${配置文件} ${检查点文件} [--out ${结果文件}] [--eval ${评估指标}] [--show]
1617

18+
# CPU: 禁用 GPU 并运行单 GPU 测试脚本
19+
export CUDA_VISIBLE_DEVICES=-1
20+
python tools/test.py ${配置文件} ${检查点文件} [--out ${结果文件}] [--eval ${评估指标}] [--show]
21+
1722
# 多卡GPU 测试
1823
./tools/dist_test.sh ${配置文件} ${检查点文件} ${GPU数目} [--out ${结果文件}] [--eval ${评估指标}]
1924
```

docs/zh_cn/train.md

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,20 @@ python tools/train.py ${配置文件} [可选参数]
2323

2424
如果您想在命令里定义工作文件夹路径,您可以添加一个参数`--work-dir ${YOUR_WORK_DIR}`
2525

26+
### 使用 CPU 训练
27+
28+
使用 CPU 训练的流程和使用单 GPU 训练的流程一致,我们仅需要在训练流程开始前禁用 GPU。
29+
30+
```shell
31+
export CUDA_VISIBLE_DEVICES=-1
32+
```
33+
34+
之后运行单 GPU 训练脚本即可。
35+
36+
```{warning}
37+
我们不推荐用户使用 CPU 进行训练,这太过缓慢。我们支持这个功能是为了方便用户在没有 GPU 的机器上进行调试。
38+
```
39+
2640
### 使用多卡 GPU 训练
2741

2842
```shell

mmseg/apis/train.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,15 @@
22
import random
33
import warnings
44

5+
import mmcv
56
import numpy as np
67
import torch
78
import torch.distributed as dist
89
from mmcv.parallel import MMDataParallel, MMDistributedDataParallel
910
from mmcv.runner import HOOKS, build_optimizer, build_runner, get_dist_info
1011
from mmcv.utils import build_from_cfg
1112

13+
from mmseg import digit_version
1214
from mmseg.core import DistEvalHook, EvalHook
1315
from mmseg.datasets import build_dataloader, build_dataset
1416
from mmseg.utils import find_latest_checkpoint, get_root_logger
@@ -99,9 +101,10 @@ def train_segmentor(model,
99101
broadcast_buffers=False,
100102
find_unused_parameters=find_unused_parameters)
101103
else:
102-
model = MMDataParallel(
103-
model.cuda(cfg.gpu_ids[0]), device_ids=cfg.gpu_ids)
104-
104+
if not torch.cuda.is_available():
105+
assert digit_version(mmcv.__version__) >= digit_version('1.4.4'), \
106+
'Please use MMCV >= 1.4.4 for CPU training!'
107+
model = MMDataParallel(model, device_ids=cfg.gpu_ids)
105108
# build runner
106109
optimizer = build_optimizer(model, cfg.optimizer)
107110

tools/test.py

Lines changed: 19 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,13 @@
88

99
import mmcv
1010
import torch
11+
from mmcv.cnn.utils import revert_sync_batchnorm
1112
from mmcv.parallel import MMDataParallel, MMDistributedDataParallel
1213
from mmcv.runner import (get_dist_info, init_dist, load_checkpoint,
1314
wrap_fp16_model)
1415
from mmcv.utils import DictAction
1516

17+
from mmseg import digit_version
1618
from mmseg.apis import multi_gpu_test, single_gpu_test
1719
from mmseg.datasets import build_dataloader, build_dataset
1820
from mmseg.models import build_segmentor
@@ -147,11 +149,18 @@ def main():
147149
cfg.model.pretrained = None
148150
cfg.data.test.test_mode = True
149151

150-
cfg.gpu_ids = [args.gpu_id]
152+
if args.gpu_id is not None:
153+
cfg.gpu_ids = [args.gpu_id]
151154

152155
# init distributed env first, since logger depends on the dist info.
153156
if args.launcher == 'none':
157+
cfg.gpu_ids = [args.gpu_id]
154158
distributed = False
159+
if len(cfg.gpu_ids) > 1:
160+
warnings.warn(f'The gpu-ids is reset from {cfg.gpu_ids} to '
161+
f'{cfg.gpu_ids[0:1]} to avoid potential error in '
162+
'non-distribute testing time.')
163+
cfg.gpu_ids = cfg.gpu_ids[0:1]
155164
else:
156165
distributed = True
157166
init_dist(args.launcher, **cfg.dist_params)
@@ -236,7 +245,15 @@ def main():
236245
tmpdir = None
237246

238247
if not distributed:
239-
model = MMDataParallel(model, device_ids=[0])
248+
warnings.warn(
249+
'SyncBN is only supported with DDP. To be compatible with DP, '
250+
'we convert SyncBN to BN. Please use dist_train.sh which can '
251+
'avoid this error.')
252+
if not torch.cuda.is_available():
253+
assert digit_version(mmcv.__version__) >= digit_version('1.4.4'), \
254+
'Please use MMCV >= 1.4.4 for CPU training!'
255+
model = revert_sync_batchnorm(model)
256+
model = MMDataParallel(model, device_ids=cfg.gpu_ids)
240257
results = single_gpu_test(
241258
model,
242259
data_loader,

0 commit comments

Comments
 (0)