Description
错误如下:
Train: 0%| | 20/6336 [08:22<44:22:11, 25.29s/it]
Train: 0%| | 20/6336 [08:22<44:22:11, 25.29s/it]
Train: 0%| | 21/6336 [08:49<44:58:38, 25.64s/it][rank3]:[E708 01:47:23.901930947 ProcessGroupNCCL.cpp:629] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2901, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600070 milliseconds before timing out.
[rank3]:[E708 01:47:23.902098000 ProcessGroupNCCL.cpp:2168] [PG ID 1 PG GUID 1 Rank 3] failure detected by watchdog at work sequence id: 2901 PG status: last enqueued work: 2901, last completed work: 2900
[rank3]:[E708 01:47:23.902111454 ProcessGroupNCCL.cpp:667] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank3]:[E708 01:47:23.097674411 ProcessGroupNCCL.cpp:681] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E708 01:47:23.097688863 ProcessGroupNCCL.cpp:695] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E708 01:47:23.098996101 ProcessGroupNCCL.cpp:1895] [PG ID 1 PG GUID 1 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2901, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600070 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f2ecd42a1b6 in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x7f2ece773c74 in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x7f2ece7757d0 in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f2ece7766ed in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f2f177f95c0 in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: + 0x7ea5 (0x7f2f21de1ea5 in /usr/lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7f2f211f9b0d in /usr/lib64/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 1 PG GUID 1 Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2901, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600070 milliseconds before timing out.
Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:632 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f2ecd42a1b6 in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x2b4 (0x7f2ece773c74 in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x890 (0x7f2ece7757d0 in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f2ece7766ed in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7f2f177f95c0 in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: + 0x7ea5 (0x7f2f21de1ea5 in /usr/lib64/libpthread.so.0)
frame #6: clone + 0x6d (0x7f2f211f9b0d in /usr/lib64/libc.so.6)
Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1901 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f2ecd42a1b6 in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: + 0xe5c6fc (0x7f2ece3d16fc in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x145c0 (0x7f2f177f95c0 in /usr/local/conda/envs/llm/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #3: + 0x7ea5 (0x7f2f21de1ea5 in /usr/lib64/libpthread.so.0)
frame #4: clone + 0x6d (0x7f2f211f9b0d in /usr/lib64/libc.so.6)