Skip to content

imagenet: add rank indicator to progress summary when distributed #551

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

hartb
Copy link

@hartb hartb commented Apr 26, 2019

The imagenet example supports distribution. When run in distributed mode
multiple ranks will be performing training and testing, and all will be
producing progress meter output. The output from the various ranks is
interleaved, and it's ambiguous which rank produced any specific bit of
progress output.

This change adds a rank indicator to the progress prefix when
distribution is in force:

Non-distributed:

$ python main.py --epochs 1 ...
=> creating model 'resnet18'
Epoch: [0][ 0/5005] Time 23.173 (23.173) ...
...
Test: [ 0/196] Time 10.899 (10.899) ...
...

Distributed (note the additional [0] and [1] in the prefixes):

$ python main.py --epochs 1 --dist-url 'tcp://127.0.0.1:2200'
--dist-backend 'nccl' --multiprocessing-distributed --world-size 1 --rank 0 ...
Use GPU: 1 for training
Use GPU: 0 for training
=> creating model 'resnet18'
=> creating model 'resnet18'
Epoch: [0][0][ 0/5005] Time 20.770 (20.770) ...
Epoch: [0][1][ 0/5005] Time 20.771 (20.771) ...
...
Test[0]: [ 0/391] Time 7.295 ( 7.295) ...
Test[1]: [ 0/391] Time 7.188 ( 7.188) ...
...

The imagenet example supports distribution. When run in distributed mode
multiple ranks will be performing training and testing, and all will be
producing progress meter output. The output from the various ranks is
interleaved, and it's ambiguous which rank produced any specific bit of
progress output.

This change adds a rank indicator to the progress prefix when
distribution is in force:

Non-distributed:

$ python main.py --epochs 1 ...
=> creating model 'resnet18'
Epoch: [0][   0/5005]   Time 23.173 (23.173) ...
...
Test: [  0/196] Time 10.899 (10.899) ...
...

Distributed (note the additional [0] and [1] in the prefixes):

$ python main.py --epochs 1 --dist-url 'tcp://127.0.0.1:2200' \
  --dist-backend 'nccl' --multiprocessing-distributed --world-size 1 --rank 0 ...
Use GPU: 1 for training
Use GPU: 0 for training
=> creating model 'resnet18'
=> creating model 'resnet18'
Epoch: [0][0][   0/5005]        Time 20.770 (20.770) ...
Epoch: [0][1][   0/5005]        Time 20.771 (20.771) ...
...
Test[0]: [  0/391]      Time  7.295 ( 7.295) ...
Test[1]: [  0/391]      Time  7.188 ( 7.188) ...
...
@hartb
Copy link
Author

hartb commented Apr 26, 2019

The rank indicator could be updated to include a more explicit hint of what it's for ("[R:..]").

This change omits the indicator when not distributed, it could always be included, maybe with some non-numeric value when not distributed ("[-]").

And could be a made a proper part of ProgressMeter with a bit more plumbing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants