-
Notifications
You must be signed in to change notification settings - Fork 109
Description
我在训练Qwen2.5-7B-base模型的时候,大约在1020步会遇到报错,报错信息如下,请问可能是什么原因导致的:
1021/100000 [45:39<51:20:53, 1.87s/it][rank2]:[E723 01:58:45.510486595 ProcessGroupNCCL.cpp:632] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=45390, OpType=ALLREDUCE, NumelIn=544997376, NumelOut=544997376, Timeout(ms)=600000) ran for 600061 milliseconds before timing out. [rank2]:[E723 01:58:45.511299168 ProcessGroupNCCL.cpp:2268] [PG ID 1 PG GUID 1 Rank 2] failure detected by watchdog at work sequence id: 45390 PG status: last enqueued work: 45403, last completed work: 45389
[rank2]:[E723 01:58:45.511334660 ProcessGroupNCCL.cpp:670] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank2]:[E723 01:58:45.511452851 ProcessGroupNCCL.cpp:2103] [PG ID 1 PG GUID 1 Rank 2] First PG on this rank to signal dumping.
[rank2]:[E723 01:58:45.541902539 ProcessGroupNCCL.cpp:1743] [PG ID 0 PG GUID 0(default_pg) Rank 2] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: 136, last completed NCCL work: 136.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank4]:[E723 01:58:45.542083210 ProcessGroupNCCL.cpp:1682] [PG ID 0 PG GUID 0(default_pg) Rank 4] Observed flight recorder dump signal from another rank via TCPStore.
[rank6]:[E723 01:58:45.542219196 ProcessGroupNCCL.cpp:1682] [PG ID 0 PG GUID 0(default_pg) Rank 6] Observed flight recorder dump signal from another rank via TCPStore.
[rank4]:[E723 01:58:45.542337765 ProcessGroupNCCL.cpp:1743] [PG ID 0 PG GUID 0(default_pg) Rank 4] Received a dump signal due to a collective timeout from rank 2 and we will try our best to dump the debug info. Last enqueued NCCL work: 136, last completed NCCL work: 136.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank2]:[E723 01:58:45.542448983 ProcessGroupNCCL.cpp:1533] [PG ID 0 PG GUID 0(default_pg) Rank 2] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank4]:[E723 01:58:45.542917575 ProcessGroupNCCL.cpp:1533] [PG ID 0 PG GUID 0(default_pg) Rank 4] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank6]:[E723 01:58:45.543140054 ProcessGroupNCCL.cpp:1743] [PG ID 0 PG GUID 0(default_pg) Rank 6] Received a dump signal due to a collective timeout from rank 2 and we will try our best to dump the debug info. Last enqueued NCCL work: 136, last completed NCCL work: 136.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank6]:[E723 01:58:45.543779382 ProcessGroupNCCL.cpp:1533] [PG ID 0 PG GUID 0(default_pg) Rank 6] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank1]:[E723 01:58:45.580819186 ProcessGroupNCCL.cpp:1682] [PG ID 0 PG GUID 0(default_pg) Rank 1] Observed flight recorder dump signal from another rank via TCPStore.
[rank1]:[E723 01:58:45.581267377 ProcessGroupNCCL.cpp:1743] [PG ID 0 PG GUID 0(default_pg) Rank 1] Received a dump signal due to a collective timeout from rank 2 and we will try our best to dump the debug info. Last enqueued NCCL work: 136, last completed NCCL work: 136.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank1]:[E723 01:58:45.581707931 ProcessGroupNCCL.cpp:1533] [PG ID 0 PG GUID 0(default_pg) Rank 1] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank0]:[E723 01:58:45.666361730 ProcessGroupNCCL.cpp:1682] [PG ID 0 PG GUID 0(default_pg) Rank 0] Observed flight recorder dump signal from another rank via TCPStore.
[rank5]:[E723 01:58:45.666422115 ProcessGroupNCCL.cpp:1682] [PG ID 0 PG GUID 0(default_pg) Rank 5] Observed flight recorder dump signal from another rank via TCPStore.
[rank0]:[E723 01:58:45.667272816 ProcessGroupNCCL.cpp:1743] [PG ID 0 PG GUID 0(default_pg) Rank 0] Received a dump signal due to a collective timeout from rank 2 and we will try our best to dump the debug info. Last enqueued NCCL work: 136, last completed NCCL work: 136.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank5]:[E723 01:58:45.667329643 ProcessGroupNCCL.cpp:1743] [PG ID 0 PG GUID 0(default_pg) Rank 5] Received a dump signal due to a collective timeout from rank 2 and we will try our best to dump the debug info. Last enqueued NCCL work: 136, last completed NCCL work: 136.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank5]:[E723 01:58:45.667329643 ProcessGroupNCCL.cpp:1743] [PG ID 0 PG GUID 0(default_pg) Rank 5] Received a dump signal due to a collective timeout from rank 2 and we will try our best to dump the debug info. Last enqueued NCCL work: 136, last completed NCCL work: 136.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank0]:[E723 01:58:45.667749122 ProcessGroupNCCL.cpp:1533] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank5]:[E723 01:58:45.667984060 ProcessGroupNCCL.cpp:1533] [PG ID 0 PG GUID 0(default_pg) Rank 5] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank3]:[E723 01:58:46.288470984 ProcessGroupNCCL.cpp:1682] [PG ID 0 PG GUID 0(default_pg) Rank 3] Observed flight recorder dump signal from another rank via TCPStore.
[rank3]:[E723 01:58:46.288679749 ProcessGroupNCCL.cpp:1743] [PG ID 0 PG GUID 0(default_pg) Rank 3] Received a dump signal due to a collective timeout from rank 2 and we will try our best to dump the debug info. Last enqueued NCCL work: 136, last completed NCCL work: 136.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank3]:[E723 01:58:46.289001278 ProcessGroupNCCL.cpp:1533] [PG ID 0 PG GUID 0(default_pg) Rank 3] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1
[rank2]:[W723 01:59:45.511676624 ProcessGroupNCCL.cpp:2111] [PG ID 1 PG GUID 1 Rank 2] timed out after waiting for 60000ms flight recorder dumps to finish.
[rank2]:[E723 01:59:45.511813287 ProcessGroupNCCL.cpp:684] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E723 01:59:45.511844169 ProcessGroupNCCL.cpp:698] [Rank 2] To avoid data inconsistency, we are taking the entire process down.