Open
Description
Hey, I found a bug when using Distributed Data Parallel (DDP) training on different nodes.
I use 4 GPUs (two GPUs per node and I use two nodes at the same time). However, I cannot run the code successfully. But I can run two GPUs on one node successfully.
Here is the log:
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:210] address family mismatch
Traceback (most recent call last):
File "script/downstream.py", line 47, in <module>
working_dir = util.create_working_directory(cfg)
File "/home/mrzz/util.py", line 38, in create_working_directory
comm.init_process_group("nccl", init_method="env://")
File "/home/miniconda3/envs/torchdrug/lib/python3.8/site-packages/torchdrug/utils/comm.py", line 67, in init_process_group
cpu_group = dist.new_group(backend="gloo")
File "/home/miniconda3/envs/torchdrug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2503, in new_group
pg = _new_process_group_helper(group_world_size,
File "/home/miniconda3/envs/torchdrug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 588, in _new_process_group_helper
pg = ProcessGroupGloo(
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:210] address family mismatch
Killing subprocess 1345179
Killing subprocess 1345180
Then I comment some codes in the comm.py
, luckily I succeed in running the code.
def init_process_group(backend, init_method=None, **kwargs):
"""
Initialize CPU and/or GPU process groups.
Parameters:
backend (str): Communication backend. Use ``nccl`` for GPUs and ``gloo`` for CPUs.
init_method (str, optional): URL specifying how to initialize the process group
"""
global cpu_group
global gpu_group
dist.init_process_group(backend, init_method, **kwargs)
gpu_group = dist.group.WORLD
# if backend == "nccl":
# cpu_group = dist.new_group(backend="gloo")
# else:
cpu_group = gpu_group
It seems like that when running on multiple nodes, the init_process_group
method can't be initialized when creating the CPU dist_group by dist.new_group(backend="gloo")
.
I am not sure whether the analysis is right, maybe you can think about this bug more comprehensively. Thank you for your work.