Some problems about loading Janus-Pro - traceback : Signal 11 (SIGSEGV) received by PID xxx #4134

SummerWXK · 2025-05-08T09:50:04Z

2 H100 CPU 36 core 200GB

Trainer.tokenizer is now deprecated. You should use Trainer.processing_class instead.
Train: 0%| | 0/1090 [00:00<?, ?it/s]From v4.47 onwards, when a model cache is to be returned, generate will return a Cache instance instead by default (as opposed to the legacy tuple of tuples format). If you want to keep returning the legacy format, please set return_legacy_cache=True.
From v4.47 onwards, when a model cache is to be returned, generate will return a Cache instance instead by default (as opposed to the legacy tuple of tuples format). If you want to keep returning the legacy format, please set return_legacy_cache=True.
/opt/conda/lib/python3.11/site-packages/torch/utils/checkpoint.py:87: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
/opt/conda/lib/python3.11/site-packages/torch/utils/checkpoint.py:87: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
W0508 14:14:06.465000 430694 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 430763 closing signal SIGTERM
E0508 14:14:07.130000 430694 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -11) local_rank: 0 (pid: 430762) of binary: /opt/conda/bin/python3.11
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 923, in
main()
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 919, in main
run(args)
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Root Cause (first observed failure):
[0]:
time : 2025-05-08_14:14:06
host : janus-pro--b43e495ad452-kh4gatae3k
rank : 0 (local_rank: 0)
exitcode : -11 (pid: 430762)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 430762

The text was updated successfully, but these errors were encountered:

SummerWXK changed the title ~~Some problems about loading Janus-Pro~~ Some problems about loading Janus-Pro - traceback : Signal 11 (SIGSEGV) received by PID xxx May 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some problems about loading Janus-Pro - traceback : Signal 11 (SIGSEGV) received by PID xxx #4134

Some problems about loading Janus-Pro - traceback : Signal 11 (SIGSEGV) received by PID xxx #4134

SummerWXK commented May 8, 2025 •

edited

Loading

Some problems about loading Janus-Pro - traceback : Signal 11 (SIGSEGV) received by PID xxx #4134

Some problems about loading Janus-Pro - traceback : Signal 11 (SIGSEGV) received by PID xxx #4134

Comments

SummerWXK commented May 8, 2025 • edited Loading

SummerWXK commented May 8, 2025 •

edited

Loading