Skip to content

Conversation

@xiexinch
Copy link
Collaborator

Motivation

As title.

Modification

Update the structure:

  • Tutorial 4: Train and test with existing models
    • Training and testing on a single machine with a single GPU
      • Training on a single GPU
      • Testing on a single GPU
    • Training and testing on multiple GPUs and multiple machines
      • Training on multiple GPUs
      • Testing oh multiple GPUs
      • Launch multiple jobs on a single machine
      • Train with multiple machines
    • Manage jobs with Slurm
      • Training on a cluster with Slurm
      • Testing on a cluster with Slurm

# use the pre-trained model for the whole PSPNet
load_from = 'https://download.openmmlab.com/mmsegmentation/v0.5/pspnet/pspnet_r50-d8_512x1024_40k_cityscapes/pspnet_r50-d8_512x1024_40k_cityscapes_20200605_003338-2966598c.pth' # model path can be found in model zoo
```
## Training and testing on a single machine with a single GPU
Copy link
Collaborator

@MeowZheng MeowZheng Sep 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Training and testing on a single machine with a single GPU
## Training and testing on a single GPU

Comment on lines 6 to 7
MMSegmentation also provides out-of-the-box tools for training models.
This section will show how to train and test models on standard datasets.
Copy link
Collaborator

@MeowZheng MeowZheng Sep 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines are a little weird and repeats with the last sentence。

Difference between `--resume` and `load-from`:
`--resume` loads both the model weights and optimizer status, and the iteration is also inherited from the specified checkpoint.
**Note:** Difference between the argument `--resume` and the field `load-from` in the config file:
`--resume` loads both the model weights and optimizer status and the iteration is also inherited from the specified checkpoint.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resume doesn't support load weights

`--resume` loads both the model weights and optimizer status and the iteration is also inherited from the specified checkpoint.
It is usually used for resuming the training process that is interrupted accidentally.

`load-from` only loads the model weights and the training iteration starts from 0. It is usually used for fine-tuning.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Note might not be required as resume doesn't support loading the specific checkpoint, and there might be confusion between resume and load_from.


- `--work-dir`: If specified, results will be saved in this directory. If not specified, the results will be automatically saved to `work_dirs/{CONFIG_NAME}`.
- `--show`: Show prediction results at runtime, available when `--show-dir` is not specified.
- `--show-dir`: If specified, the visualized segmentation mask will be saved in the specified directory.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parser.add_argument(
'--show-dir',
help='directory where painted images will be saved. '
'If specified, it will be automatically saved '
'to the work_dir/timestamp/show_dir')

### Launch multiple jobs on a single machine

If you launch multiple jobs on a single machine, e.g., 2 jobs of 4-GPU training on a machine with 8 GPUs, you need to specify different ports (29500 by default) for each job to avoid communication conflict. Otherwise, there will be an error message saying `RuntimeError: Address already in use`.
If you use `dist_train.sh` to launch training jobs, you can set the port in commands with the environment variable \`PORT\`\`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If you use `dist_train.sh` to launch training jobs, you can set the port in commands with the environment variable \`PORT\`\`.
If you use `dist_train.sh` to launch training jobs, you can set the port in commands with the environment variable `PORT`.

Comment on lines 29 to 81
**Note:** Difference between the argument `--resume` and the field `load-from` in the config file:

`load-from` only loads the model weights and the training iteration starts from 0. It is usually used for fine-tuning.
`--resume` only determines whether to resume from the latest checkpoint in the work_dir. It is usually used for resuming the training process that is interrupted accidentally.

### Training on CPU
`load-from` will specify the checkpoint to be loaded and the training iteration starts from 0. It is usually used for fine-tuning.

The process of training on the CPU is consistent with single GPU training if machine does not have GPU. If it has GPUs but not wanting to use it, we just need to disable GPUs before the training process.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note:
If you would like to resume training from a specific checkpoint, you can use --resume with --cfg-options load-from=$CHECKPOINT.

@MeowZheng MeowZheng merged commit 52ce34c into open-mmlab:1.x Sep 16, 2022
MeowZheng pushed a commit to MeowZheng/mmsegmentation that referenced this pull request Nov 1, 2022
* draft

* refine structure

* fix typo

* rename single gpu title and redefine --resume

* update introduction

* add notes to load_from
@MeowZheng MeowZheng added the Doc label Nov 2, 2022
wjkim81 pushed a commit to wjkim81/mmsegmentation that referenced this pull request Dec 3, 2023
nahidnazifi87 pushed a commit to nahidnazifi87/mmsegmentation_playground that referenced this pull request Apr 5, 2024
* draft

* refine structure

* fix typo

* rename single gpu title and redefine --resume

* update introduction

* add notes to load_from
nahidnazifi87 pushed a commit to nahidnazifi87/mmsegmentation_playground that referenced this pull request Apr 5, 2024
* draft

* refine structure

* fix typo

* rename single gpu title and redefine --resume

* update introduction

* add notes to load_from
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants