Skip to content

关于断点续训的问题resume_from_checkpoint #3951

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
shifop opened this issue Apr 21, 2025 · 1 comment
Open

关于断点续训的问题resume_from_checkpoint #3951

shifop opened this issue Apr 21, 2025 · 1 comment

Comments

@shifop
Copy link

shifop commented Apr 21, 2025

在大数据集上,恢复训练无法正常进行,表象是卡在进度0%

在小数据集上可以正常进行

是参数没传对吗?

大数据集的训练脚本

CUDA_VISIBLE_DEVICES=0,1,2,3 \
HF_ENDPOINT=https://hf-mirror.com \
NPROC_PER_NODE=4 \
nohup swift sft \
    --model /HDD/train/output/Qwen2.5-0.5B-Instruct/v1-20250401-234003/checkpoint-68086 \
    --resume_from_checkpoint /HDD/train/output/checkpoint-68086/v9-20250420-214017/checkpoint-2000 \
    --resume_only_model false \
    --custom_register_path /HDD/train/convert/convert.py \
    --train_type full \
    --torch_dtype bfloat16 \
    --system "" \
    --dataset "/HDD/train/data/dataset/lemon/all.27946228.jsonl" \
    --val_dataset "/HDD/train/data/dataset/ecspell/test.jsonl" \
    --streaming true \
    --truncation_strategy delete \
    --num_train_epochs 1 \
    --learning_rate 5.89e-4 \
    --lr_scheduler_type 'cosine_with_min_lr' \
    --lr_scheduler_kwargs '{"min_lr":0.00001}' \
    --eval_steps 500 \
    --max_steps 10713 \
    --save_steps 500 \
    --max_length 220 \
    --warmup_ratio 0 \
    --gradient_accumulation_steps 66 \
    --per_device_train_batch_size 12 \
    --per_device_eval_batch_size 12 \
    --save_total_limit 5 > ./log/lemon.log &
@gmftbyGMFTBY
Copy link

遇到了同样的问题,急求解决方案!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants