-
Notifications
You must be signed in to change notification settings - Fork 139
Description
Hi, authors. Thanks for your great work! I would like to know what changes should I make to enable training two process on one machine, which has 8 GPUs in total. I used the first four for training a model and when I tried to use the left four for another model, I got the error message
Traceback (most recent call last):
File "/mnt/sda/TEL_syn/main.py", line 185, in
handle_distributed(args_parser, os.path.expanduser(os.path.abspath(file)))
File "/mnt/sda/TEL_syn/lib/utils/distributed.py", line 31, in handle_distributed
_setup_process_group(args)
File "/mnt/sda/TEL_syn/lib/utils/distributed.py", line 74, in _setup_process_group
torch.distributed.init_process_group(
File "/home/jiw010/anaconda3/envs/tel/lib/python3.8/site-packages/torch/distributed/dist
ributed_c10d.py", line 500, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/jiw010/anaconda3/envs/tel/lib/python3.8/site-packages/torch/distributed/rend
ezvous.py", line 190, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use