训练Gemma3模型出错

错误如下：
Train:   0%|          | 20/6336 [08:22<44:22:11, 25.29s/it]
Train:   0%|          | 20/6336 [08:22<44:22:11, 25.29s/it]
Train:   0%|          | 21/6336 [08:49<44:58:38, 25.64s/it][rank3]:[E708 01:47:23.901930947 ProcessGroupNCCL.cpp:629] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2901, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600070 milliseconds before timing out.
[rank3]:[E708 01:47:23.902098000 ProcessGroupNCCL.cpp:2168] [PG ID 1 PG GUID 1 Rank 3]  failure detected by watchdog at work sequence id: 2901 PG status: last enqueued work: 2901, last completed work: 2900
[rank3]:[E708 01:47:23.902111454 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank3]:[E708 01:47:23.097674411 ProcessGroupNCCL.cpp:681] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E708 01:47:23.097688863 ProcessGroupNCCL.cpp:695] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E708 01:47:23.098996101 ProcessGroupNCCL.cpp:1895] [PG ID 1 PG GUID 1 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2901, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600070 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f2ecd42a1b6 in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x7f2ece773c74 in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x7f2ece7757d0 in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f2ece7766ed in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f2f177f95c0 in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x7ea5 (0x7f2f21de1ea5 in /usr/lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7f2f211f9b0d in /usr/lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG ID 1 PG GUID 1 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2901, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600070 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f2ecd42a1b6 in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x7f2ece773c74 in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x7f2ece7757d0 in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f2ece7766ed in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x145c0 (0x7f2f177f95c0 in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: <unknown function> + 0x7ea5 (0x7f2f21de1ea5 in /usr/lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7f2f211f9b0d in /usr/lib64/libc.so.6)

Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1901 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f2ecd42a1b6 in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5c6fc (0x7f2ece3d16fc in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0x145c0 (0x7f2f177f95c0 in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x7ea5 (0x7f2f21de1ea5 in /usr/lib64/libpthread.so.0)
frame #4: clone + 0x6d (0x7f2f211f9b0d in /usr/lib64/libc.so.6)

W0708 01:47:24.695000 7691 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 7757 closing signal SIGTERM
W0708 01:47:24.697000 7691 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 7758 closing signal SIGTERM
W0708 01:47:24.698000 7691 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 7759 closing signal SIGTERM
W0708 01:47:24.699000 7691 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 7761 closing signal SIGTERM
W0708 01:47:24.700000 7691 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 7762 closing signal SIGTERM
W0708 01:47:24.702000 7691 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 7763 closing signal SIGTERM
W0708 01:47:24.703000 7691 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 7764 closing signal SIGTERM
E0708 01:47:29.163000 7691 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 3 (pid: 7760) of binary: /usr/local/conda/envs/llm/bin/python3.11
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/distributed/run.py", line 922, in <module>
    main()
  File "/usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/distributed/run.py", line 918, in main
    run(args)
  File "/usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/distributed/run.py", line 909, in run
    elastic_launch(
  File "/usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/workdir/swift/cli/sft.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-07-08_01:47:24
  host      : psx148dxr7siz4a1-worker-0.psx148dxr7siz4a1.hadoop-poistar.svc.cluster.local.
  rank      : 3 (local_rank: 3)
  exitcode  : -6 (pid: 7760)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 7760
============================================================

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

训练Gemma3模型出错 #4917

/workdir/swift/cli/sft.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-07-08_01:47:24
host : psx148dxr7siz4a1-worker-0.psx148dxr7siz4a1.hadoop-poistar.svc.cluster.local.
rank : 3 (local_rank: 3)
exitcode : -6 (pid: 7760)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 7760

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

训练Gemma3模型出错 #4917

Description

/workdir/swift/cli/sft.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2025-07-08_01:47:24 host : psx148dxr7siz4a1-worker-0.psx148dxr7siz4a1.hadoop-poistar.svc.cluster.local. rank : 3 (local_rank: 3) exitcode : -6 (pid: 7760) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 7760

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-07-08_01:47:24
host : psx148dxr7siz4a1-worker-0.psx148dxr7siz4a1.hadoop-poistar.svc.cluster.local.
rank : 3 (local_rank: 3)
exitcode : -6 (pid: 7760)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 7760