Skip to content

resnet training on imagenet is failing #1071

Closed
@pruthvistony

Description

@pruthvistony

Environment

pyTorch - upstream code base > 1.12
UB 20.04
GPU - 4

Steps to Reproduce

python imagenet/main.py -a resnet50 --dist-url tcp://127.0.0.1:8080 --dist-backend nccl --multiprocessing-distributed --world-size 1 --rank 0 <imagenet data dir> --epochs 3 --batch-size 256 -j64

Failure signature

731:731 [2] NCCL INFO comm 0x7f0078000ef0 rank 2 nranks 4 cudaDev 2 busId 88000 - Abort COMPLETE
730:730 [1] NCCL INFO comm 0x7f1750000ef0 rank 1 nranks 4 cudaDev 1 busId 3d000 - Abort COMPLETE
732:732 [3] NCCL INFO comm 0x7fe214000ef0 rank 3 nranks 4 cudaDev 3 busId b1000 - Abort COMPLETE
729:729 [0] NCCL INFO comm 0x7f2edc000ef0 rank 0 nranks 4 cudaDev 0 busId 1a000 - Abort COMPLETE
Traceback (most recent call last):
File "main.py", line 516, in
main()
File "main.py", line 117, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/var/lib/jenkins/examples/imagenet/main.py", line 278, in main_worker
train(train_loader, model, criterion, optimizer, epoch, args)
File "/var/lib/jenkins/examples/imagenet/main.py", line 331, in train
loss = criterion(output, target)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 1166, in forward
label_smoothing=self.label_smoothing)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 2970, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cpu! (when checking argument for argument target in method wrapper_nll_loss_forward)

Possible Regression

Git reset to commit 5a06e9c
is working fine.
5a06e9c

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions