imagenet: add rank indicator to progress summary when distributed #551
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The imagenet example supports distribution. When run in distributed mode
multiple ranks will be performing training and testing, and all will be
producing progress meter output. The output from the various ranks is
interleaved, and it's ambiguous which rank produced any specific bit of
progress output.
This change adds a rank indicator to the progress prefix when
distribution is in force:
Non-distributed:
$ python main.py --epochs 1 ...
=> creating model 'resnet18'
Epoch: [0][ 0/5005] Time 23.173 (23.173) ...
...
Test: [ 0/196] Time 10.899 (10.899) ...
...
Distributed (note the additional [0] and [1] in the prefixes):
$ python main.py --epochs 1 --dist-url 'tcp://127.0.0.1:2200'
--dist-backend 'nccl' --multiprocessing-distributed --world-size 1 --rank 0 ...
Use GPU: 1 for training
Use GPU: 0 for training
=> creating model 'resnet18'
=> creating model 'resnet18'
Epoch: [0][0][ 0/5005] Time 20.770 (20.770) ...
Epoch: [0][1][ 0/5005] Time 20.771 (20.771) ...
...
Test[0]: [ 0/391] Time 7.295 ( 7.295) ...
Test[1]: [ 0/391] Time 7.188 ( 7.188) ...
...