Skip to content
This repository was archived by the owner on Nov 1, 2024. It is now read-only.
This repository was archived by the owner on Nov 1, 2024. It is now read-only.

ERROR: Unexpected segmentation fault encountered in worker. #146

@zbsean

Description

@zbsean

when trainning, raise error:
ERROR: Unexpected segmentation fault encountered in worker.

Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 986, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/usr/lib/python3.6/queue.py", line 173, in get
self.not_empty.wait(remaining)
File "/usr/lib/python3.6/threading.py", line 299, in wait
gotit = waiter.acquire(True, timeout)
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 624) is killed by signal: Segmentation fault.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/workspace/pycls/train1.0.1/train_net.py", line 24, in
main()
File "/workspace/pycls/train1.0.1/train_net.py", line 20, in main
dist.multi_proc_run(num_proc=cfg.NUM_GPUS, fun=trainer.train_model)
File "/workspace/pycls/train1.0.1/pycls/core/distributed.py", line 146, in multi_proc_run
fun(*fun_args, **fun_kwargs)
File "/workspace/pycls/train1.0.1/pycls/core/trainer.py", line 185, in train_model
train_epoch(*params, cur_epoch)
File "/workspace/pycls/train1.0.1/pycls/core/trainer.py", line 88, in train_epoch
for cur_iter, (inputs, labels) in enumerate(loader):
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 517, in next
data = self._next_data()
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1182, in _next_data
idx, data = self._get_data()
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1138, in _get_data
success, data = self._try_get_data()
File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 999, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 624) exited unexpectedly

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions