resnet training on imagenet is failing

## Environment
pyTorch - upstream code base > 1.12
UB 20.04
GPU - 4


## Steps to Reproduce

`python imagenet/main.py -a resnet50 --dist-url tcp://127.0.0.1:8080 --dist-backend nccl --multiprocessing-distributed --world-size 1 --rank 0 <imagenet data dir> --epochs 3 --batch-size 256 -j64`

## Failure signature
731:731 [2] NCCL INFO comm 0x7f0078000ef0 rank 2 nranks 4 cudaDev 2 busId 88000 - Abort COMPLETE
730:730 [1] NCCL INFO comm 0x7f1750000ef0 rank 1 nranks 4 cudaDev 1 busId 3d000 - Abort COMPLETE
732:732 [3] NCCL INFO comm 0x7fe214000ef0 rank 3 nranks 4 cudaDev 3 busId b1000 - Abort COMPLETE
729:729 [0] NCCL INFO comm 0x7f2edc000ef0 rank 0 nranks 4 cudaDev 0 busId 1a000 - Abort COMPLETE
Traceback (most recent call last):
  File "main.py", line 516, in <module>
    main()
  File "main.py", line 117, in main
    mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/var/lib/jenkins/examples/imagenet/main.py", line 278, in main_worker
    train(train_loader, model, criterion, optimizer, epoch, args)
  File "/var/lib/jenkins/examples/imagenet/main.py", line 331, in train
    loss = criterion(output, target)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 1166, in forward
    label_smoothing=self.label_smoothing)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 2970, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu! (when checking argument for argument target in method wrapper_nll_loss_forward)


## Possible Regression
Git reset to commit 5a06e9cac1728c860b53ebfc6792e0a0e21a5678
is working fine.
https://github.com/pytorch/examples/commit/5a06e9cac1728c860b53ebfc6792e0a0e21a5678



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

resnet training on imagenet is failing #1071

Environment

Steps to Reproduce

Failure signature

Possible Regression

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

resnet training on imagenet is failing #1071

Description

Environment

Steps to Reproduce

Failure signature

Possible Regression

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions