Skip to content

[Bug Report] Distributed Training RuntimeError  #156

Open
@mrzzmrzz

Description

@mrzzmrzz

Hey, I found a bug when using Distributed Data Parallel (DDP) training on different nodes.

I use 4 GPUs (two GPUs per node and I use two nodes at the same time). However, I cannot run the code successfully. But I can run two GPUs on one node successfully.

Here is the log:

RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:210] address family mismatch
Traceback (most recent call last):
  File "script/downstream.py", line 47, in <module>
    working_dir = util.create_working_directory(cfg)
  File "/home/mrzz/util.py", line 38, in create_working_directory
    comm.init_process_group("nccl", init_method="env://")
  File "/home/miniconda3/envs/torchdrug/lib/python3.8/site-packages/torchdrug/utils/comm.py", line 67, in init_process_group
    cpu_group = dist.new_group(backend="gloo")
  File "/home/miniconda3/envs/torchdrug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2503, in new_group
    pg = _new_process_group_helper(group_world_size,
  File "/home/miniconda3/envs/torchdrug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 588, in _new_process_group_helper
    pg = ProcessGroupGloo(
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:210] address family mismatch
Killing subprocess 1345179
Killing subprocess 1345180

Then I comment some codes in the comm.py, luckily I succeed in running the code.

def init_process_group(backend, init_method=None, **kwargs):
    """
    Initialize CPU and/or GPU process groups.

    Parameters:
        backend (str): Communication backend. Use ``nccl`` for GPUs and ``gloo`` for CPUs.
        init_method (str, optional): URL specifying how to initialize the process group
    """
    global cpu_group
    global gpu_group

    dist.init_process_group(backend, init_method, **kwargs)
    gpu_group = dist.group.WORLD
    # if backend == "nccl":
    # cpu_group = dist.new_group(backend="gloo")
    # else:
    cpu_group = gpu_group

It seems like that when running on multiple nodes, the init_process_group method can't be initialized when creating the CPU dist_group by dist.new_group(backend="gloo").

I am not sure whether the analysis is right, maybe you can think about this bug more comprehensively. Thank you for your work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions