|
1 | 1 | # Tutorial 2: Customize Datasets |
2 | 2 |
|
| 3 | +## Data configuration |
| 4 | + |
| 5 | +`data` in config file is the variable for data configuration, to define the arguments that are used in datasets and dataloaders. |
| 6 | + |
| 7 | +Here is an example of data configuration: |
| 8 | + |
| 9 | +```python |
| 10 | +data = dict( |
| 11 | + samples_per_gpu=4, |
| 12 | + workers_per_gpu=4, |
| 13 | + train=dict( |
| 14 | + type='ADE20KDataset', |
| 15 | + data_root='data/ade/ADEChallengeData2016', |
| 16 | + img_dir='images/training', |
| 17 | + ann_dir='annotations/training', |
| 18 | + pipeline=train_pipeline), |
| 19 | + val=dict( |
| 20 | + type='ADE20KDataset', |
| 21 | + data_root='data/ade/ADEChallengeData2016', |
| 22 | + img_dir='images/validation', |
| 23 | + ann_dir='annotations/validation', |
| 24 | + pipeline=test_pipeline), |
| 25 | + test=dict( |
| 26 | + type='ADE20KDataset', |
| 27 | + data_root='data/ade/ADEChallengeData2016', |
| 28 | + img_dir='images/validation', |
| 29 | + ann_dir='annotations/validation', |
| 30 | + pipeline=test_pipeline)) |
| 31 | +``` |
| 32 | + |
| 33 | +- `train`, `val` and `test`: The [`config`](https://github.com/open-mmlab/mmcv/blob/master/docs/en/understand_mmcv/config.md)s to build dataset instances for model training, validation and testing by |
| 34 | +using [`build and registry`](https://github.com/open-mmlab/mmcv/blob/master/docs/en/understand_mmcv/registry.md) mechanism. |
| 35 | + |
| 36 | +- `samples_per_gpu`: How many samples per batch and per gpu to load during model training, and the `batch_size` of training is equal to `samples_per_gpu` times gpu number, e.g. when using 8 gpus for distributed data parallel trainig and `samples_per_gpu=4`, the `batch_size` is `8*4=16`. |
| 37 | +If you would like to define `batch_size` for testing and validation, please use `test_dataloaser` and |
| 38 | +`val_dataloader` with mmseg >=0.24.1. |
| 39 | + |
| 40 | +- `workers_per_gpu`: How many subprocesses per gpu to use for data loading. `0` means that the data will be loaded in the main process. |
| 41 | + |
| 42 | +**Note:** `samples_per_gpu` only works for model training, and the default setting of `samples_per_gpu` is 1 in mmseg when model testing and validation (DO NOT support batch inference yet). |
| 43 | + |
| 44 | +**Note:** before v0.24.1, except `train`, `val` `test`, `samples_per_gpu` and `workers_per_gpu`, the other keys in `data` must be the |
| 45 | +input keyword arguments for `dataloader` in pytorch, and the dataloaders used for model training, validation and testing have the same input arguments. |
| 46 | +In v0.24.1, mmseg supports to use `train_dataloader`, `test_dataloaser` and `val_dataloader` to specify different keyword arguments, and still supports the overall arguments definition but the specific dataloader setting has a higher priority. |
| 47 | + |
| 48 | +Here is an example for specific dataloader: |
| 49 | + |
| 50 | +```python |
| 51 | +data = dict( |
| 52 | + samples_per_gpu=4, |
| 53 | + workers_per_gpu=4, |
| 54 | + shuffle=True, |
| 55 | + train=dict(type='xxx', ...), |
| 56 | + val=dict(type='xxx', ...), |
| 57 | + test=dict(type='xxx', ...), |
| 58 | + # Use different batch size during validation and testing. |
| 59 | + val_dataloader=dict(samples_per_gpu=1, workers_per_gpu=4, shuffle=False), |
| 60 | + test_dataloader=dict(samples_per_gpu=1, workers_per_gpu=4, shuffle=False)) |
| 61 | +``` |
| 62 | + |
| 63 | +Assume only one gpu used for model training and testing, as the priority of the overall arguments definition is low, the batch_size |
| 64 | +for training is `4` and dataset will be shuffled, and batch_size for testing and validation is `1`, and dataset will not be shuffled. |
| 65 | + |
| 66 | +To make data configuration much clearer, we recommend use specific dataloader setting instead of overall dataloader setting after v0.24.1, just like: |
| 67 | + |
| 68 | +```python |
| 69 | +data = dict( |
| 70 | + train=dict(type='xxx', ...), |
| 71 | + val=dict(type='xxx', ...), |
| 72 | + test=dict(type='xxx', ...), |
| 73 | + # Use specific dataloader setting |
| 74 | + train_dataloader=dict(samples_per_gpu=4, workers_per_gpu=4, shuffle=True), |
| 75 | + val_dataloader=dict(samples_per_gpu=1, workers_per_gpu=4, shuffle=False), |
| 76 | + test_dataloader=dict(samples_per_gpu=1, workers_per_gpu=4, shuffle=False)) |
| 77 | +``` |
| 78 | + |
| 79 | +**Note:** in model training, default values in the script of mmseg for dataloader are `shuffle=True, and drop_last=True`, |
| 80 | +in model validation and testing, default values are `shuffle=False, and drop_last=False` |
| 81 | + |
3 | 82 | ## Customize datasets by reorganizing data |
4 | 83 |
|
5 | 84 | The simplest way is to convert your dataset to organize your data into folders. |
|
0 commit comments