Qwen2.5-7B-Base 超长文本训练部分step之后报错 #4105

leileilin · 2025-05-07T03:10:46Z

Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程，最好有截图)
Qwen2.5-7B-Base 超长文本训练部分step之后报错NCCL watchdog thread terminated with exception。
训练命令如下：
deepspeed --hostfile=/etc/mpi/hostfile
swift/cli/sft.py
--model $PRETRAIN_MODEL
--torch_dtype bfloat16
--train_type full
--use_chat_template
--dataset $data_path
--packing true
--num_train_epochs 3
--per_device_train_batch_size $per_node_bsz
--data_seed 42
--weight_decay 0.1
--learning_rate 1e-5
--attn_impl flash_attn
--deepspeed zero3
--gradient_accumulation_steps $gradient_accumulation_steps
--warmup_ratio 0.01
--dataset_num_proc 8
--system "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."
--save_total_limit 5
--save_strategy epoch
--eval_strategy no
--max_length 131072
--truncation_strategy delete
--split_dataset_ratio 0
--output_dir $output_dir
--use_liger_kernel true
--lazy_tokenize true
--use_hf \

Bug截图如下：

Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息，如CUDA版本，系统，GPU型号和torch版本等)
GPU型号：H800
CUDA版本：12.4
torch版本：2.6.0
系统版本：
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

Additional context
Add any other context about the problem here(在这里补充其他信息)

leileilin · 2025-05-07T09:57:10Z

Describe the bug What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程，最好有截图) Qwen2.5-7B-Base 超长文本训练部分step之后报错NCCL watchdog thread terminated with exception。训练命令如下： deepspeed --hostfile=/etc/mpi/hostfile swift/cli/sft.py --model $PRETRAIN_MODEL --torch_dtype bfloat16 --train_type full --use_chat_template --dataset $data_path --packing true --num_train_epochs 3 --per_device_train_batch_size $per_node_bsz --data_seed 42 --weight_decay 0.1 --learning_rate 1e-5 --attn_impl flash_attn --deepspeed zero3 --gradient_accumulation_steps $gradient_accumulation_steps --warmup_ratio 0.01 --dataset_num_proc 8 --system "You are Qwen, created by Alibaba Cloud. You are a helpful assistant." --save_total_limit 5 --save_strategy epoch --eval_strategy no --max_length 131072 --truncation_strategy delete --split_dataset_ratio 0 --output_dir $output_dir --use_liger_kernel true --lazy_tokenize true --use_hf \

Bug截图如下：

Your hardware and system info Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息，如CUDA版本，系统，GPU型号和torch版本等) GPU型号：H800 CUDA版本：12.4 torch版本：2.6.0 系统版本： NAME="Ubuntu" VERSION="20.04.6 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.6 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal

Additional context Add any other context about the problem here(在这里补充其他信息)

已解决
export NCCL_IB_DISABLE=0
export NCCL_IB_GID_INDEX=3
export NCCL_SOCKET_IFNAME=eth
export NCCL_IB_HCA=mlx5

leileilin closed this as completed May 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen2.5-7B-Base 超长文本训练部分step之后报错 #4105

Qwen2.5-7B-Base 超长文本训练部分step之后报错 #4105

leileilin commented May 7, 2025

leileilin commented May 7, 2025

Qwen2.5-7B-Base 超长文本训练部分step之后报错 #4105

Qwen2.5-7B-Base 超长文本训练部分step之后报错 #4105

Comments

leileilin commented May 7, 2025

leileilin commented May 7, 2025