| 
1 | 1 | # Tutorial 2: Customize Datasets  | 
2 | 2 | 
 
  | 
 | 3 | +## Data configuration  | 
 | 4 | + | 
 | 5 | +`data` in config file is the variable for data configuration, to define the arguments that are used in datasets and dataloaders.  | 
 | 6 | + | 
 | 7 | +Here is an example of data configuration:  | 
 | 8 | + | 
 | 9 | +```python  | 
 | 10 | +data = dict(  | 
 | 11 | +    samples_per_gpu=4,  | 
 | 12 | +    workers_per_gpu=4,  | 
 | 13 | +    train=dict(  | 
 | 14 | +        type='ADE20KDataset',  | 
 | 15 | +        data_root='data/ade/ADEChallengeData2016',  | 
 | 16 | +        img_dir='images/training',  | 
 | 17 | +        ann_dir='annotations/training',  | 
 | 18 | +        pipeline=train_pipeline),  | 
 | 19 | +    val=dict(  | 
 | 20 | +        type='ADE20KDataset',  | 
 | 21 | +        data_root='data/ade/ADEChallengeData2016',  | 
 | 22 | +        img_dir='images/validation',  | 
 | 23 | +        ann_dir='annotations/validation',  | 
 | 24 | +        pipeline=test_pipeline),  | 
 | 25 | +    test=dict(  | 
 | 26 | +        type='ADE20KDataset',  | 
 | 27 | +        data_root='data/ade/ADEChallengeData2016',  | 
 | 28 | +        img_dir='images/validation',  | 
 | 29 | +        ann_dir='annotations/validation',  | 
 | 30 | +        pipeline=test_pipeline))  | 
 | 31 | +```  | 
 | 32 | + | 
 | 33 | +- `train`, `val` and `test`: The [`config`](https://github.com/open-mmlab/mmcv/blob/master/docs/en/understand_mmcv/config.md)s to build dataset instances for model training, validation and testing by  | 
 | 34 | +using [`build and registry`](https://github.com/open-mmlab/mmcv/blob/master/docs/en/understand_mmcv/registry.md) mechanism.  | 
 | 35 | + | 
 | 36 | +- `samples_per_gpu`: How many samples per batch and per gpu to load during model training, and the `batch_size` of training is equal to `samples_per_gpu` times gpu number, e.g. when using 8 gpus for distributed data parallel trainig and `samples_per_gpu=4`, the `batch_size` is `8*4=16`.  | 
 | 37 | +If you would like to define `batch_size` for testing and validation, please use `test_dataloaser` and  | 
 | 38 | +`val_dataloader` with mmseg >=0.24.1.  | 
 | 39 | + | 
 | 40 | +- `workers_per_gpu`: How many subprocesses per gpu to use for data loading. `0` means that the data will be loaded in the main process.  | 
 | 41 | + | 
 | 42 | +**Note:** `samples_per_gpu` only works for model training, and the default setting of `samples_per_gpu` is 1 in mmseg when model testing and validation (DO NOT support batch inference yet).  | 
 | 43 | + | 
 | 44 | +**Note:** before v0.24.1, except `train`, `val` `test`, `samples_per_gpu` and `workers_per_gpu`, the other keys in `data` must be the  | 
 | 45 | +input keyword arguments for `dataloader` in pytorch, and the dataloaders used for model training, validation and testing have the same input arguments.  | 
 | 46 | +In v0.24.1, mmseg supports to use `train_dataloader`, `test_dataloaser` and `val_dataloader` to specify different keyword arguments, and still supports the overall arguments definition but the specific dataloader setting has a higher priority.  | 
 | 47 | + | 
 | 48 | +Here is an example for specific dataloader:  | 
 | 49 | + | 
 | 50 | +```python  | 
 | 51 | +data = dict(  | 
 | 52 | +    samples_per_gpu=4,  | 
 | 53 | +    workers_per_gpu=4,  | 
 | 54 | +    shuffle=True,  | 
 | 55 | +    train=dict(type='xxx', ...),  | 
 | 56 | +    val=dict(type='xxx', ...),  | 
 | 57 | +    test=dict(type='xxx', ...),  | 
 | 58 | +    # Use different batch size during validation and testing.  | 
 | 59 | +    val_dataloader=dict(samples_per_gpu=1, workers_per_gpu=4, shuffle=False),  | 
 | 60 | +    test_dataloader=dict(samples_per_gpu=1, workers_per_gpu=4, shuffle=False))  | 
 | 61 | +```  | 
 | 62 | + | 
 | 63 | +Assume only one gpu used for model training and testing, as the priority of the overall arguments definition is low, the batch_size  | 
 | 64 | +for training is `4` and dataset will be shuffled, and batch_size for testing and validation is `1`, and dataset will not be shuffled.  | 
 | 65 | + | 
 | 66 | +To make data configuration much clearer, we recommend use specific dataloader setting instead of overall dataloader setting after v0.24.1, just like:  | 
 | 67 | + | 
 | 68 | +```python  | 
 | 69 | +data = dict(  | 
 | 70 | +    train=dict(type='xxx', ...),  | 
 | 71 | +    val=dict(type='xxx', ...),  | 
 | 72 | +    test=dict(type='xxx', ...),  | 
 | 73 | +    # Use specific dataloader setting  | 
 | 74 | +    train_dataloader=dict(samples_per_gpu=4, workers_per_gpu=4, shuffle=True),  | 
 | 75 | +    val_dataloader=dict(samples_per_gpu=1, workers_per_gpu=4, shuffle=False),  | 
 | 76 | +    test_dataloader=dict(samples_per_gpu=1, workers_per_gpu=4, shuffle=False))  | 
 | 77 | +```  | 
 | 78 | + | 
 | 79 | +**Note:** in model training, default values in the script of mmseg for dataloader are `shuffle=True, and drop_last=True`,  | 
 | 80 | +in model validation and testing, default values are `shuffle=False, and drop_last=False`  | 
 | 81 | + | 
3 | 82 | ## Customize datasets by reorganizing data  | 
4 | 83 | 
 
  | 
5 | 84 | The simplest way is to convert your dataset to organize your data into folders.  | 
 | 
0 commit comments