[Bug Report] Distributed Training RuntimeError 

Hey, I found a bug when using Distributed Data Parallel (DDP) training on different nodes.

I use 4 GPUs (two GPUs per node and I use two nodes at the same time). However, I cannot run the code successfully. But I can run two GPUs on one node successfully.

Here is the log:
```python
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:210] address family mismatch
Traceback (most recent call last):
  File "script/downstream.py", line 47, in <module>
    working_dir = util.create_working_directory(cfg)
  File "/home/mrzz/util.py", line 38, in create_working_directory
    comm.init_process_group("nccl", init_method="env://")
  File "/home/miniconda3/envs/torchdrug/lib/python3.8/site-packages/torchdrug/utils/comm.py", line 67, in init_process_group
    cpu_group = dist.new_group(backend="gloo")
  File "/home/miniconda3/envs/torchdrug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2503, in new_group
    pg = _new_process_group_helper(group_world_size,
  File "/home/miniconda3/envs/torchdrug/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 588, in _new_process_group_helper
    pg = ProcessGroupGloo(
RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:210] address family mismatch
Killing subprocess 1345179
Killing subprocess 1345180
```

Then I comment some codes in the `comm.py`, luckily I succeed in running the code.
```python
def init_process_group(backend, init_method=None, **kwargs):
    """
    Initialize CPU and/or GPU process groups.

    Parameters:
        backend (str): Communication backend. Use ``nccl`` for GPUs and ``gloo`` for CPUs.
        init_method (str, optional): URL specifying how to initialize the process group
    """
    global cpu_group
    global gpu_group

    dist.init_process_group(backend, init_method, **kwargs)
    gpu_group = dist.group.WORLD
    # if backend == "nccl":
    # cpu_group = dist.new_group(backend="gloo")
    # else:
    cpu_group = gpu_group
```

It seems like that when running on multiple nodes, the `init_process_group` method can't be initialized when creating the CPU dist_group by `dist.new_group(backend="gloo")`.

I am not sure whether the analysis is right, maybe you can think about this bug more comprehensively. Thank you for your work.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug Report] Distributed Training RuntimeError #156

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug Report] Distributed Training RuntimeError #156

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions