Description
Environment
pyTorch - upstream code base > 1.12
UB 20.04
GPU - 4
Steps to Reproduce
python imagenet/main.py -a resnet50 --dist-url tcp://127.0.0.1:8080 --dist-backend nccl --multiprocessing-distributed --world-size 1 --rank 0 <imagenet data dir> --epochs 3 --batch-size 256 -j64
Failure signature
731:731 [2] NCCL INFO comm 0x7f0078000ef0 rank 2 nranks 4 cudaDev 2 busId 88000 - Abort COMPLETE
730:730 [1] NCCL INFO comm 0x7f1750000ef0 rank 1 nranks 4 cudaDev 1 busId 3d000 - Abort COMPLETE
732:732 [3] NCCL INFO comm 0x7fe214000ef0 rank 3 nranks 4 cudaDev 3 busId b1000 - Abort COMPLETE
729:729 [0] NCCL INFO comm 0x7f2edc000ef0 rank 0 nranks 4 cudaDev 0 busId 1a000 - Abort COMPLETE
Traceback (most recent call last):
File "main.py", line 516, in
main()
File "main.py", line 117, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/var/lib/jenkins/examples/imagenet/main.py", line 278, in main_worker
train(train_loader, model, criterion, optimizer, epoch, args)
File "/var/lib/jenkins/examples/imagenet/main.py", line 331, in train
loss = criterion(output, target)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 1166, in forward
label_smoothing=self.label_smoothing)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 2970, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu! (when checking argument for argument target in method wrapper_nll_loss_forward)