Qwen3-8B-Base SFT 全参微调保存第一个模型后hang住 #4053

leileilin · 2025-04-30T08:50:22Z

Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程，最好有截图)

Qwen3-8B-Base SFT 全参微调保存第一个模型后hang住，而且查看模型保存目录有两个保存模型目录文件。

执行命令如下：
deepspeed --hostfile=/etc/mpi/hostfile
swift/cli/sft.py
--model $PRETRAIN_MODEL
--torch_dtype bfloat16
--train_type full
--use_chat_template
--dataset $data_path
--packing True
--num_train_epochs 3
--per_device_train_batch_size $per_node_bsz
--data_seed 42
--weight_decay 0.1
--learning_rate 1e-5
--attn_impl flash_attn
--deepspeed zero1
--gradient_accumulation_steps $gradient_accumulation_steps
--warmup_ratio 0.01
--dataset_num_proc 8
--system "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."
--save_total_limit 5
--save_strategy epoch
--max_length 8192
--truncation_strategy delete
--split_dataset_ratio 0
--eval_strategy no
--output_dir $output_dir
--use_liger_kernel true
--lazy_tokenize true
--use_hf \

bug截图如下：

Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息，如CUDA版本，系统，GPU型号和torch版本等)

GPU型号：H800
CUDA版本：12.4
torch版本：2.6.0
系统版本：
NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

Additional context
Add any other context about the problem here(在这里补充其他信息)

Jintao-Huang · 2025-05-01T10:28:41Z

这里应该是在保存模型，只是存的比较慢

Jintao-Huang · 2025-05-01T10:29:10Z

可以试试 --save_only_model true

leileilin · 2025-05-06T07:02:14Z

这里应该是在保存模型，只是存的比较慢

好的，后面好像又没有这个问题了，非常神奇，但是当时确实慢的离谱。。。

leileilin closed this as completed May 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3-8B-Base SFT 全参微调保存第一个模型后hang住 #4053

Qwen3-8B-Base SFT 全参微调保存第一个模型后hang住 #4053

leileilin commented Apr 30, 2025

Jintao-Huang commented May 1, 2025

Jintao-Huang commented May 1, 2025

leileilin commented May 6, 2025

Qwen3-8B-Base SFT 全参微调保存第一个模型后hang住 #4053

Qwen3-8B-Base SFT 全参微调保存第一个模型后hang住 #4053

Comments

leileilin commented Apr 30, 2025

Jintao-Huang commented May 1, 2025

Jintao-Huang commented May 1, 2025

leileilin commented May 6, 2025