Skip to content

训练Gemma3模型出错 #4917

Open
Open
@HouTong-s

Description

@HouTong-s

错误如下:
Train: 0%| | 20/6336 [08:22<44:22:11, 25.29s/it]
Train: 0%| | 20/6336 [08:22<44:22:11, 25.29s/it]
Train: 0%| | 21/6336 [08:49<44:58:38, 25.64s/it][rank3]:[E708 01:47:23.901930947 ProcessGroupNCCL.cpp:629] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2901, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600070 milliseconds before timing out.
[rank3]:[E708 01:47:23.902098000 ProcessGroupNCCL.cpp:2168] [PG ID 1 PG GUID 1 Rank 3] failure detected by watchdog at work sequence id: 2901 PG status: last enqueued work: 2901, last completed work: 2900
[rank3]:[E708 01:47:23.902111454 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank3]:[E708 01:47:23.097674411 ProcessGroupNCCL.cpp:681] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E708 01:47:23.097688863 ProcessGroupNCCL.cpp:695] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E708 01:47:23.098996101 ProcessGroupNCCL.cpp:1895] [PG ID 1 PG GUID 1 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2901, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600070 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f2ecd42a1b6 in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x7f2ece773c74 in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x7f2ece7757d0 in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f2ece7766ed in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f2f177f95c0 in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: + 0x7ea5 (0x7f2f21de1ea5 in /usr/lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7f2f211f9b0d in /usr/lib64/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 1 PG GUID 1 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2901, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600070 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f2ecd42a1b6 in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x7f2ece773c74 in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x7f2ece7757d0 in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f2ece7766ed in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f2f177f95c0 in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: + 0x7ea5 (0x7f2f21de1ea5 in /usr/lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7f2f211f9b0d in /usr/lib64/libc.so.6)

Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1901 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f2ecd42a1b6 in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: + 0xe5c6fc (0x7f2ece3d16fc in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x145c0 (0x7f2f177f95c0 in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #3: + 0x7ea5 (0x7f2f21de1ea5 in /usr/lib64/libpthread.so.0)
frame #4: clone + 0x6d (0x7f2f211f9b0d in /usr/lib64/libc.so.6)

W0708 01:47:24.695000 7691 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 7757 closing signal SIGTERM
W0708 01:47:24.697000 7691 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 7758 closing signal SIGTERM
W0708 01:47:24.698000 7691 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 7759 closing signal SIGTERM
W0708 01:47:24.699000 7691 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 7761 closing signal SIGTERM
W0708 01:47:24.700000 7691 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 7762 closing signal SIGTERM
W0708 01:47:24.702000 7691 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 7763 closing signal SIGTERM
W0708 01:47:24.703000 7691 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 7764 closing signal SIGTERM
E0708 01:47:29.163000 7691 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 3 (pid: 7760) of binary: /usr/local/conda/envs/llm/bin/python3.11
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/distributed/run.py", line 922, in
main()
File "/usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/workdir/swift/cli/sft.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-07-08_01:47:24
host : psx148dxr7siz4a1-worker-0.psx148dxr7siz4a1.hadoop-poistar.svc.cluster.local.
rank : 3 (local_rank: 3)
exitcode : -6 (pid: 7760)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 7760

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions